TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and theres nothing that can stop us from growing together.
What we are looking for
Role: AI SRE (Docker,kuberenetes,Ansible)
Experience Range: 6 ? 8 Years
Location: Bangalore
Must Have:
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
- Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
- Excellent communication, documentation, and cross-team collaboration skills
- Proven track record of reducing operational toil via automation
Good to Have:
- Understanding of SRE techniques.
- Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.
- Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.
- Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
- Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.
- Experience working with Generative AI development, embeddings, fine tuning of Generative AI models.
- Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
- Understanding of ModelOps/ ML Ops/ LLM Op.
- Experience with chaos engineering, canary deployments, blue/green rollouts
Essential:
- Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
- Design and build automation for core platform capabilities, reducing manual toil
- Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
- Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
- Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
- Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
- Optimize cost vs. performance tradeoffs in large-scale compute environments
- Harden systems for security, compliance, auditability, and data governance
- Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
- Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
- Maintain runbooks, operational playbooks, documentation, and training materials
- Participate in on-call rotations and respond to production incidents 24/7 as needed
- Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Minimum Qualification:
- 15 years of full-time education
- Minimum percentile of 50% in 10th, 12th, UG & PG (if applicable)
Other DetailsEmployment Type: Full Time, Permanent
Role Category: IT Infrastructure Services
TCS Hiring AI SRE (Docker,kuberenetes,Ansible) in Bengaluru, Apply TCS Careers in Bengaluru.