CVS Health Logo

CVS Health

Executive Director, AI Infrastructure & Platform Engineering

Posted 12 Hours Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in Home, TN
Expert/Leader
In-Office or Remote
Hiring Remotely in Home, TN
Expert/Leader
Lead design, build, and operation of an on-prem AI compute platform (GPU, storage, network, Kubernetes/OpenShift). Establish SRE practices, 24/7 operations, observability, security/compliance, vendor and budget ownership, and hire and develop teams to ensure >99.99% availability and long-term operational sustainability.
The summary above was generated by AI

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

Job Description

The Executive Director, AI Infrastructure & Platform Engineering is a senior engineering leadership role responsible for standing up, operating, and continuously improving CVS Health's on-premises AI compute platform. This position owns the physical and platform layers of CVS’s Enterprise AI Factory — a frontier-class GPU compute environment running NVIDIA Blackwell systems across a high-throughput RoCE v2 fabric, hosted in co-located data center facilities, with multi-site expansion underway.

Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, this leader will establish operational baselines across the full infrastructure stack — hardware, network fabric, GPU clusters, storage, and the operating systems and orchestration layers above — and build the Site Reliability Engineering practice that delivers the availability, reliability, and performance that frontier AI workloads demand.

This is a greenfield organizational build. The Executive Director will define the operating model, set the engineering standards, hire and develop the team, and establish the long-term operations capability that will govern CVS's AI infrastructure for years ahead.

Key Responsibilities

Strategy and Leadership:

  • Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success.
  • Recruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOps.
  • Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement.
  • Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives.

Infrastructure and Platform Engineering:

  • Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability.
  • Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement.
  • Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation.
  • Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved within defined SLAs.
  • Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability.

Operations and Reliability:

  • Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention.
  • Drive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles.
  • Oversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment.
  • Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time.
  • Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization.

Security and Compliance:

  • Empower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF.
  • Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment.

Program Transition and Operating Model:

  • Lead the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations.
  • Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close.

Vendor and Stakeholder Management:

  • Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack.
  • Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance.

Required Qualifications

The successful candidate will demonstrate technical depth, executive presence, and a proven record of operating physical infrastructure at data center scale. The ideal candidate will bring the following experience, knowledge, and abilities:

  • 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution).
  • Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires.
  • Fluency with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management).
  • 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes.
  • Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environments.
  • Hardware lifecycle, vendor accountability, and facility coordination experience — including capacity planning, RMA management, and multi-vendor escalation.
  • Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables.
  • Executive-level stakeholder communication, vendor negotiation, and budget ownership.

Preferred Qualifications

  • Hands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage.
  • Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads.
  • NVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experience.
  • Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR).
  • Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns.
  • Background in innovation programs, POD structures, or centers of excellence.

Education

  • Required:  Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical field.

Pay Range

The typical pay range for this role is:

$175,100.00 - $334,750.00


This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls.  The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors.  This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.  This position also includes an award target in the company’s equity award program. 
 

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people

We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.

This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.


Additional details about available benefits are provided during the application process and on
Benefits Moments.

We anticipate the application window for this opening will close on: 09/30/2026

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.

Similar Jobs

12 Minutes Ago
Remote or Hybrid
Senior level
Senior level
Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Manage complex tax returns and consulting for partnerships, mentor staff, and develop tax planning strategies while ensuring compliance with tax regulations.
Top Skills: Internal Revenue CodeIrs RegulationsTax-Related Software
13 Minutes Ago
Remote or Hybrid
Senior level
Senior level
Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
As a Senior AI Engineer, you'll design and implement AI-driven solutions, focusing on full-stack development and production-grade applications while mentoring junior team members.
Top Skills: .NetAIAngularHaystackLangchainLlamaindexMachine LearningNode.jsOcrPythonRagReactVue
13 Minutes Ago
Remote or Hybrid
Senior level
Senior level
Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Manage tax compliance and advisory work, review tax returns, build client relationships, research tax matters, and mentor staff.
Top Skills: AccountingCpaTax Compliance

What you need to know about the Mumbai Tech Scene

From haggling for the best price at Chor Bazaar to the bustle of Crawford Market, the energy of Mumbai's traditional markets is a key part of the city's charm. And while these markets will always have their place, the city also boasts a thriving e-commerce scene, ranking among the largest in the region. Driven by online sales in everything from snacks to licensed sports merchandise to children's apparel, the local industry is worth billions, with companies actively recruiting to meet the demands of continued growth.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account