The Senior Site Reliability Engineer will enhance observability and reliability in large distributed systems through monitoring, incident response, and automation.
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
Job TitleSenior Site Reliability Engineer (SRE) – Observability & DevOps
Role SummaryWe are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.
This role requires both deep technical expertise and production ownership mindset.
Primary ResponsibilitiesObservability & Monitoring- Design, implement, and maintain end-to-end observability using:
- Prometheus for metrics collection
- Alertmanager for alert routing, deduplication, and escalation
- Grafana for visualization and dashboards
- AppDynamics for APM, transaction tracing, and application health
- Build actionable dashboards for:
- SLIs, SLOs, and error budgets
- Application, infrastructure, and platform health
- Reduce alert fatigue by implementing signal-based alerting and proper severity models
- Manage and optimize ClickHouse for:
- High-volume metrics, logs, or traces
- Long-term retention and fast analytical queries
- Work on schema design, performance tuning, and cost optimization
- Define and measure SRE best practices (SLIs, SLOs, SLAs)
- Participate in incident response, postmortems, and root cause analysis
- Drive reliability improvements through automation and capacity planning
- Develop tooling and automation using at least one scripting/programming language
- Automate monitoring onboarding, alert generation, dashboard creation
- Improve operational efficiencies across DevOps tooling
- Strong Linux fundamentals
- Troubleshooting, performance tuning, networking, system internals
- Scripting / Programming (Any one or more):
- Python (preferred), Bash, Go, or similar
- Observability Tools (Hands-on):
- Prometheus
- Alertmanager
- Grafana
- AppDynamics
- Data Platform:
- Hands-on experience with ClickHouse
- Metrics vs logs vs traces
- Golden signals (latency, traffic, errors, saturation)
- Alert thresholds, routing policies, escalation strategies
- Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
- Infrastructure as Code (Terraform, Helm)
- CI/CD observability
- Cloud platforms (AWS / Azure / GCP)
- Experience managing observability at scale (100+ services / platforms)
- Ability to architect observability solutions, not just operate them
- Strong production troubleshooting and incident ownership
- Mentoring junior engineers
- Influence DevOps and SRE best practices across teams
- Communicate clearly with developers and leadership
- 5-7 years of experience in SRE / DevOps / Production Engineering
- Experience operating high-availability, large-scale systems
- Proven background in observability-driven reliability improvements
Top Skills
Alertmanager
Appdynamics
Bash
Clickhouse
Go
Grafana
Prometheus
Python
Similar Jobs
Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Design and code AI, cloud, and machine learning solutions. Build scalable software, engage in prioritization, and mentor team members.
Top Skills:
AWSAzureCheckmarxJavaJenkinsJfrog XraySonarSpring BootSQLVeracode
Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Cybersecurity • Data Privacy
As an Engineering Manager, you'll lead software engineering teams, improve product quality and stability, and mentor employees while driving customer-centric initiatives and project delivery.
Top Skills:
Data ProtectionMachine LearningSecurityStorage
Healthtech • Information Technology • Software • Telehealth
The Provider Data Operations Associate will maintain a database of provider profiles, ensure data accuracy, analyze datasets, and comply with privacy standards, requiring attention to detail and SQL skills.
Top Skills:
ExcelGoogle SheetsSQL
What you need to know about the Mumbai Tech Scene
From haggling for the best price at Chor Bazaar to the bustle of Crawford Market, the energy of Mumbai's traditional markets is a key part of the city's charm. And while these markets will always have their place, the city also boasts a thriving e-commerce scene, ranking among the largest in the region. Driven by online sales in everything from snacks to licensed sports merchandise to children's apparel, the local industry is worth billions, with companies actively recruiting to meet the demands of continued growth.


