Similar Jobs
Company Introduction
We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we are collectively disrupting the multi-billion-dollar commerce industry from the ground up and establishing an unparalleled reputation for being leading and reliable force in South Korean commerce.
We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been at since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
The ICT Reliability Engineering team is dedicated to maintaining the continuity and stability of Coupang’s enterprise IT services. The team operates and continuously improves monitoring systems for both IT infrastructure and applications, ensuring high visibility and rapid incident detection. In the event of service disruptions, the team collaborates closely with engineering and operations teams to resolve issues efficiently and manage key performance metrics. Additionally, the team leads regular disaster recovery (DR) tests to validate system resilience and ensure business continuity.
Key Responsibilities:
- Identify operational inefficiencies and automation opportunities within monitoring workflows and infrastructure.
- Design and implement automated solutions for deployment, configuration, and scaling of monitoring tools using Infrastructure-as-Code (IaC) technologies such as Terraform, Ansible, Puppet, or similar.
- Leverage REST APIs of platforms like Zabbix, SolarWinds, Prometheus, and Grafana to streamline and standardize monitoring setup and management.
- Develop reusable automation assets—scripts, templates, and modules—to ensure consistent monitoring practices across diverse environments.
- Automate Grafana dashboard creation and management, including templating, data source integration, and role-based access control.
- Integrate monitoring systems with alerting, ticketing, and reporting platforms to enable seamless incident management and visibility.
- Establish tagging strategies and observability standards to ensure uniform data collection and traceability across services.
- Support incident response by building automated diagnostics and enriching telemetry data for faster root cause analysis.
- Collaborate cross-functionally with DevOps and SRE teams to align monitoring automation with CI/CD pipelines and operational goals.
Infrastructure as Code (IaC) & Automation
- Terraform
- Ansible
- Puppet
- Scripting languages: Python, Bash, PowerShell, SSH
- Zabbix
- SolarWinds
- Prometheus
- Grafana (including dashboard templating, provisioning, and API-based automation)
- Datadog or Dynatrace (as alternatives or complementary tools)
- Experience working with REST APIs for automation and integration
- Familiarity with JSON, YAML, and HTTP methods (GET, POST, PUT,
- DELETE)
- Jenkins, GitLab CI, GitHub Actions, or similar
- Docker and Kubernetes (for containerized environments)
- ServiceNow, Jira, VictorOps, xMatters, or similar
- Knowledge of event correlation and automated diagnostics
- AWS, Azure, or Google Cloud Platform
- Cloud-native monitoring tools like CloudWatch, Azure Monitor, or GCP Operations Suite
- Soft Skills & Operational Mindset
- Strong problem-solving and gap analysis capabilities
- Ability to identify low-hanging fruits for automation
- Experience in cross-functional collaboration (DevOps, SRE, IT Ops)
- Understanding of observability principles and tagging strategies
Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week, with the flexibility to work from home 2 days a week, depending on the role requirement. Some businesses may require more time in office due to nature of work.
Privacy Notice
·Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below: https://www.coupang.jobs/privacy-policy/


