AI SRE (Docker,kuberenetes,Ansible) - Bengaluru

Full Time 1 month ago

Employment Information

Job Level Experienced Professional

Experience Less Than 1 Year

Job Type Full Time

Location Bengaluru

TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and theres nothing that can stop us from growing together.

What we are looking for

Role: AI SRE (Docker,kuberenetes,Ansible)

Experience Range: 6 ? 8 Years

Location: Bangalore

Must Have:

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Good to Have:

Understanding of SRE techniques.
Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.
Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.
Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.
Experience working with Generative AI development, embeddings, fine tuning of Generative AI models.
Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
Understanding of ModelOps/ ML Ops/ LLM Op.
Experience with chaos engineering, canary deployments, blue/green rollouts

Essential:

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Minimum Qualification: