Job Title: Senior Associate Cloud SRE
Education: Any Graduate
Experience: 4–8 years
Location: Mumbai (Hybrid Model)
Employment Type: Full-time
Overview
We are seeking a Site Reliability Engineer to deliver tier two cloud operations managed services support across AWS and Azure environments. This role combines advanced troubleshooting and operational excellence with proactive reliability engineering, focusing on maintaining 24x7x365 service availability while continuously improving automation and operational efficiency across multi-cloud infrastructure.
Role Summary
As a Site Reliability Engineer supporting multi-cloud infrastructure (AWS and Azure), you will manage complex operational challenges and escalations while implementing reliability best practices across production systems. You will work collaboratively with customer teams and senior engineers to ensure system stability, automate operational workflows, and maintain comprehensive observability. This is a delivery-focused role requiring both advanced technical execution and operational ownership across cloud platforms.
Primary Responsibilities:
Tier 2 Multi-Cloud Operations & Managed Services:
AWS Operations:
Provide 24x7x365 tier two support and escalation handling for AWS environments
Execute complex operational tasks including:
Patching and managing Amazon Machine Images (AMIs)
Creating and configuring EC2 instances and RDS databases
Managing IAM roles, users, and policies
Configuring S3 bucket policies and Access Control Lists (ACLs)
Opening and managing network routes (VPC, subnets, security groups)
Restoring snapshots and database backups to lower environments
Increasing disk sizes (EBS volumes) and managing storage optimization
Implementing proper tagging for environment identification and cost allocation
Managing logs archiving using CloudWatch Logs and S3
Azure Operations:
Provide equivalent tier two support for Azure cloud environments
Execute Azure-specific operational tasks including:
Managing and updating Azure Virtual Machine images
Creating and configuring Azure Virtual Machines and Azure SQL databases
Managing Azure Active Directory (AAD) identities, roles, and role-based access control (RBAC)
Configuring Azure Storage account policies and access controls
Managing Virtual Networks, Network Security Groups (NSGs), and route tables
Restoring VM snapshots and database backups to lower environments
Managing disk resizing and Azure Managed Disks optimization
Implementing Azure resource tagging and cost management
Managing log archiving using Azure Monitor and Log Analytics
Cross-Cloud Responsibilities:
Handle escalations from tier one support with deep technical analysis across both platforms
Provide root cause analysis for complex incidents in multi-cloud environments
Implement consistent operational standards across AWS and Azure
Support hybrid cloud connectivity and integration scenarios
Reliability & Incident Management:
Implement and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across AWS and Azure in collaboration with senior engineers and customer stakeholders
Lead tier two incident response, performing advanced troubleshooting and resolution on both cloud platforms
Conduct thorough post-incident analysis with actionable remediation plans
Reduce reactive work by improving runbooks, alert configurations, and standard operating procedures for both clouds
Apply reliability engineering best practices with oversight and review
Mentor tier one engineers during incident response across multi-cloud scenarios
Automation & Infrastructure as Code:
Build and maintain CI/CD pipelines for infrastructure and application deployments on AWS and Azure
Automate complex operational tasks including patching, backups, and environment provisioning across both platforms
Develop infrastructure automation using Terraform for multi-cloud environments
Create sophisticated scripts and tooling to eliminate manual toil and improve operational efficiency
Implement Azure Resource Manager (ARM) templates or Bicep for Azure-specific automation
Follow established patterns and contribute continuous improvements
Document automation processes for knowledge sharing across cloud platforms
Containerization & Deployment:
Deploy and operate containerized workloads using Docker on AWS services (ECS, EKS) and Azure services (AKS, Azure Container Instances)
Support container reliability through proper health checks, autoscaling configurations, and resource management on both platforms
Implement safe deployment patterns (canary deployments, blue/green deployments) across AWS and Azure
Troubleshoot complex containerization and orchestration issues in multi-cloud Kubernetes environments
Follow and enhance established containerization standards across both cloud providers
Observability & Performance:
Configure and maintain comprehensive monitoring, logging, and alerting systems across AWS CloudWatch and Azure Monitor
Leverage observability data to identify issues and lead root cause analysis in multi-cloud environments
Contribute to performance tuning and cost optimization initiatives across both platforms
Ensure proper instrumentation and telemetry across AWS and Azure environments
Identify patterns and trends to prevent future incidents
Build custom dashboards and reports using CloudWatch, Azure Monitor, and third-party tools (Datadog, Grafana)
Collaboration & Customer Engagement:
Work closely with customer development and operations teams to improve system operability across cloud platforms
Participate in design reviews and reliability assessments for multi-cloud architectures
Communicate technical concepts, tradeoffs, and recommendations clearly to stakeholders
Provide regular operational updates and service reports covering both AWS and Azure
Act as technical liaison between customers and internal engineering teams
Required Qualifications & Experience:
3–5 years of hands-on experience in DevOps, SRE, or production operations roles
Proven experience operating production systems in AWS OR Azure (deep expertise in one required)
Working knowledge or exposure to the secondary cloud platform (ability to learn and support)
Demonstrated experience managing containerized applications in production
Experience delivering managed services or supporting customer-facing infrastructure
Track record of handling complex technical escalations in cloud environments
Technical Skills - Primary Cloud Platform (AWS OR Azure)
For AWS-Primary Candidates:
AWS Services (Expert): Deep knowledge of EC2, RDS, S3, IAM, VPC, CloudWatch, Lambda, and related services
AWS Networking (Expert): Strong experience with VPCs, subnets, security groups, route tables, and VPN/Direct Connect
AWS Storage (Expert): Proficiency with EBS, S3, and backup/restore strategies
AWS Containers (Expert): Hands-on experience with ECS, EKS, or Fargate
Azure (Foundational): Basic understanding of Azure services with willingness to learn; exposure to Azure VMs, Storage, or networking is a plus
For Azure-Primary Candidates:
Azure Services (Expert): Deep knowledge of Azure VMs, Azure SQL, Storage Accounts, Azure AD, Virtual Networks, Azure Monitor
Azure Networking (Expert): Strong experience with VNets, NSGs, Application Gateway, Azure Firewall, and ExpressRoute
Azure Storage (Expert): Proficiency with Managed Disks, Blob Storage, and Azure Backup
Azure Containers (Expert): Hands-on experience with AKS (Azure Kubernetes Service) and Azure Container Instances
AWS (Foundational): Basic understanding of AWS services with willingness to learn; exposure to EC2, S3, or VPC is a plus
Technical Skills - Cross-Platform (All Candidates):
Infrastructure as Code: Proficiency with Terraform (preferred) or CloudFormation/ARM templates
CI/CD: Experience building and maintaining automated deployment pipelines (Azure DevOps, GitHub Actions, Jenkins, GitLab CI)
Scripting/Programming: Proficiency in Python, PowerShell, Bash, or similar languages
Containerization: Strong Docker skills and Kubernetes experience
Monitoring & Logging: Experience with cloud-native monitoring tools and/or third-party observability platforms (Datadog, Splunk, ELK, Grafana)
Version Control: Proficiency with Git and collaborative development workflows
Troubleshooting: Advanced diagnostic and problem-solving capabilities
Operational Capabilities:
Experience with 24x7 operations and tier two escalation support
Strong troubleshooting and root cause analysis skills
Understanding of networking concepts, security best practices, and compliance requirements
Familiarity with backup/restore procedures and disaster recovery planning
Ability to work under pressure during critical incidents
Experience coordinating across distributed teams
Willingness and ability to quickly learn the secondary cloud platform
Preferred Qualifications & Certifications:
AWS Certifications (for AWS-primary): Solutions Architect Associate, SysOps Administrator, or DevOps Engineer Professional
Azure Certifications (for Azure-primary): Azure Administrator Associate (AZ-104) or Azure Solutions Architect Expert (AZ-305)
Cloud-agnostic certifications (Terraform Associate, CKA, or SRE Foundation)
Additional Preferred Experience:
Any hands-on experience with both AWS and Azure (even if limited in one)
Experience with Kubernetes in production environments
Prior consulting or managed services provider experience
Experience with hybrid cloud or cloud migration projects
Experience with configuration management tools (Ansible, Chef, Puppet)
Knowledge of security and compliance frameworks (HIPAA, SOC 2, PCI-DSS)
Experience in high-traffic or mission-critical industries
Experience with cost optimization and FinOps practices
Multi-cloud architecture or implementation experience
About UsDatavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leading technologies. For more than 17 years, Datavail has worked with thousands of companies spanning different industries and sizes, and is an AWS Advanced Tier Consulting Partner, a Microsoft Solutions Partner for Data & AI and Digital & App Innovation (Azure), an Oracle Partner, and a MySQL Partner. About the Team
