Zensar Technologies Logo

Zensar Technologies

Engineer Lead, Site Reliability

Posted 12 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in India
Senior level
In-Office or Remote
Hiring Remotely in India
Senior level
The Engineer Lead - Site Reliability will provide leadership to enhance system reliability, mentor teams, and oversee incident management and chaos engineering initiatives for platforms in Banking Solutions and Capital Markets.
The summary above was generated by AI

Job Description: Engineer Lead – Site Reliability Engineering (SRE)
Role Overview
The Engineer Lead – Site Reliability Engineering (SRE) will provide technical and thought leadership to ensure the reliability, resiliency, scalability, and observability of mission‑critical platforms supporting Banking Solutions, Payments, and Capital Markets.
This role blends advanced SRE practices, resiliency and chaos engineering, and service health governance with people and technical leadership. The Engineer Lead will define reliability standards, mentor teams, and partner closely with Engineering, DevOps, Security, and Product stakeholders to embed reliability as a core product feature.

What You Will Be Doing
SRE Leadership & Strategy

Act as the technical lead and reliability champion, driving SRE best practices across multiple teams and platforms.
Define and evangelize reliability standards, principles, and operational excellence frameworks.
Guide teams in balancing feature velocity with system reliability using error budgets and SLO-driven decision-making.
Mentor and coach engineers in SRE, observability, incident management, and automation.

Service Health, SLI/SLO & Observability

Define, implement, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Establish standardized service health monitoring and reporting frameworks across platforms.
Design and maintain end-to-end observability solutions covering infrastructure, applications, APIs, and customer experience.
Drive reliability insights through dashboards, health scores, and executive-level metrics.

Resiliency Testing & Chaos Engineering

Lead resiliency engineering initiatives to validate system behavior under failure conditions.
Design and execute chaos engineering experiments to proactively identify weaknesses in architecture and operations.
Integrate resiliency testing into CI/CD pipelines and pre-production environments.
Partner with development and architecture teams to ensure systems are fault-tolerant, self-healing, and resilient by design.

Incident Management & Operational Excellence

Lead high-severity incident response efforts, providing clear technical and operational direction.
Establish and continuously improve incident management, escalation, and communication practices.
Drive blameless post-incident reviews, ensuring root causes are addressed and preventive actions are implemented.
Measure and improve operational KPIs such as MTTR, MTTD, and incident recurrence rates.

Automation, Platform Reliability & Cloud Operations

Champion automation-first approaches to reduce toil and manual intervention.
Oversee deployment pipelines, configuration management, and release reliability practices.
Guide teams on Infrastructure as Code (IaC), environment consistency, and cloud governance.
Ensure disaster recovery, backup, and failover strategies are tested and production-ready.

Cross-Functional Collaboration & Governance

Collaborate with Engineering, QA, DevOps, Security, Architecture, and Product teams to embed reliability into the SDLC.
Ensure platforms comply with security, regulatory, and audit requirements, especially in financial services environments.
Influence technical roadmaps to prioritize resiliency, stability, and customer experience.

Required Skills & Experience
Core Requirements

Strong experience with Core SRE practices, including system reliability, incident management, automation, and observability.
Hands-on expertise in resiliency testing and chaos engineering methodologies.
Proven experience designing and operating SLI / SLO / Error Budget frameworks at scale.
Deep understanding of distributed systems, microservices architectures, and cloud-native platforms.
Experience with cloud platforms (AWS, Azure, and/or Google Cloud).
Hands-on experience with Docker and Kubernetes.
Expertise in monitoring, observability, and logging tools, such as:

Prometheus, Grafana, Datadog
Splunk, ELK Stack

Strong background in incident management, post-mortem facilitation, and production support.
Proficiency in automation and scripting (Python, Bash, Terraform, Ansible).
Experience managing and improving CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Ability to lead technical discussions, influence decisions, and communicate effectively with senior stakeholders.
Strong ownership mindset with accountability for service reliability and customer outcomes.

Nice to Have (SRE+ Skills)

Experience with Harness Chaos Engineering (CE) or similar chaos engineering platforms.
Programming experience in Java, particularly for debugging, performance analysis, or building internal SRE tooling.
Experience implementing self-healing and auto-remediation workflows.
Exposure to banking, payments, or capital markets domains.
Familiarity with chaos engineering maturity models and reliability governance practices.

Responsibilities

What You Will Be Doing
SRE Leadership & Strategy

Act as the technical lead and reliability champion, driving SRE best practices across multiple teams and platforms.
Define and evangelize reliability standards, principles, and operational excellence frameworks.
Guide teams in balancing feature velocity with system reliability using error budgets and SLO-driven decision-making.
Mentor and coach engineers in SRE, observability, incident management, and automation.

Service Health, SLI/SLO & Observability

Define, implement, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Establish standardized service health monitoring and reporting frameworks across platforms.
Design and maintain end-to-end observability solutions covering infrastructure, applications, APIs, and customer experience.
Drive reliability insights through dashboards, health scores, and executive-level metrics.

Resiliency Testing & Chaos Engineering

Lead resiliency engineering initiatives to validate system behavior under failure conditions.
Design and execute chaos engineering experiments to proactively identify weaknesses in architecture and operations.
Integrate resiliency testing into CI/CD pipelines and pre-production environments.
Partner with development and architecture teams to ensure systems are fault-tolerant, self-healing, and resilient by design.

Incident Management & Operational Excellence

Lead high-severity incident response efforts, providing clear technical and operational direction.
Establish and continuously improve incident management, escalation, and communication practices.
Drive blameless post-incident reviews, ensuring root causes are addressed and preventive actions are implemented.
Measure and improve operational KPIs such as MTTR, MTTD, and incident recurrence rates.

Automation, Platform Reliability & Cloud Operations

Champion automation-first approaches to reduce toil and manual intervention.
Oversee deployment pipelines, configuration management, and release reliability practices.
Guide teams on Infrastructure as Code (IaC), environment consistency, and cloud governance.
Ensure disaster recovery, backup, and failover strategies are tested and production-ready.

Cross-Functional Collaboration & Governance

Collaborate with Engineering, QA, DevOps, Security, Architecture, and Product teams to embed reliability into the SDLC.
Ensure platforms comply with security, regulatory, and audit requirements, especially in financial services environments.
Influence technical roadmaps to prioritize resiliency, stability, and customer experience.

Qualifications

Required Skills & Experience
Core Requirements

Strong experience with Core SRE practices, including system reliability, incident management, automation, and observability.
Hands-on expertise in resiliency testing and chaos engineering methodologies.
Proven experience designing and operating SLI / SLO / Error Budget frameworks at scale.
Deep understanding of distributed systems, microservices architectures, and cloud-native platforms.
Experience with cloud platforms (AWS, Azure, and/or Google Cloud).
Hands-on experience with Docker and Kubernetes.
Expertise in monitoring, observability, and logging tools, such as:

Prometheus, Grafana, Datadog
Splunk, ELK Stack

Strong background in incident management, post-mortem facilitation, and production support.
Proficiency in automation and scripting (Python, Bash, Terraform, Ansible).
Experience managing and improving CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Ability to lead technical discussions, influence decisions, and communicate effectively with senior stakeholders.
Strong ownership mindset with accountability for service reliability and customer outcomes.

Nice to Have (SRE+ Skills)

Experience with Harness Chaos Engineering (CE) or similar chaos engineering platforms.
Programming experience in Java, particularly for debugging, performance analysis, or building internal SRE tooling.
Experience implementing self-healing and auto-remediation workflows.
Exposure to banking, payments, or capital markets domains.
Familiarity with chaos engineering maturity models and reliability governance practices.

About UsAt Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus.
Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself.
We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

Top Skills

Ansible
AWS
Azure
Azure Devops
Bash
Datadog
Docker
Elk Stack
Gitlab Ci/Cd
GCP
Grafana
Jenkins
Kubernetes
Prometheus
Python
Splunk
Terraform

Similar Jobs

3 Days Ago
Remote or Hybrid
Senior level
Senior level
Artificial Intelligence • Big Data • Cloud • Information Technology • Machine Learning • Software
As a Senior Site Reliability Engineer, you'll enhance infrastructure, manage cloud systems, perform incident responses, and ensure reliability of services at Nexthink.
Top Skills: AWSBashCi/CdCrossplaneDatadogDockerFluxcdGithub ActionsGitlab CiGoJenkinsKubernetesLinuxPythonTerraform
4 Days Ago
Remote
Shri Bhrigukshetra, BLR, Uttar Pradesh, IND
Senior level
Senior level
Fintech • Analytics
The Lead Database SRE ensures the reliability and performance of database platforms using SRE and DevOps practices, focusing on automation, incident response, and cloud database operations.
Top Skills: AnsibleAWSBashCi/CdDockerHelmKubernetesLinuxAzureOraclePostgresPowershellPythonSQL ServerTerraform
11 Days Ago
In-Office or Remote
Expert/Leader
Expert/Leader
Biotech
The Platform Reliability & Observability Lead (SRE) enhances operational excellence by ensuring reliability, managing observability strategies, automation, and incident management across cloud environments.
Top Skills: ArmAWSAzureBicepGCPGoJavaKubernetesPythonTerraform

What you need to know about the Mumbai Tech Scene

From haggling for the best price at Chor Bazaar to the bustle of Crawford Market, the energy of Mumbai's traditional markets is a key part of the city's charm. And while these markets will always have their place, the city also boasts a thriving e-commerce scene, ranking among the largest in the region. Driven by online sales in everything from snacks to licensed sports merchandise to children's apparel, the local industry is worth billions, with companies actively recruiting to meet the demands of continued growth.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account