CrowdStrike Logo

CrowdStrike

Sr. ML Platform Engineer (Hybrid, IND)

Reposted 46 Minutes Ago
Be an Early Applicant
Hybrid
Bangalore, Bengaluru Urban, Karnataka
Senior level
Hybrid
Bangalore, Bengaluru Urban, Karnataka
Senior level
Maintain and optimize CrowdStrike's ML infrastructure: diagnose distributed system incidents across Ray/Spark/SLURM/JupyterHub, profile and tune performance, build observability and runbooks, collaborate with ML teams, and mentor on debugging to ensure reliable training and inference pipelines.
The summary above was generated by AI

As a global leader in cybersecurity, CrowdStrike protects the people, processes and technologies that drive modern organizations. Since 2011, our mission hasn’t changed — we’re here to stop breaches, and we’ve redefined modern security with the world’s most advanced AI-native platform. We work on large scale distributed systems, processing almost 3 trillion events per day and this traffic is growing daily. Our customers span all industries, and they count on CrowdStrike to keep their businesses running, their communities safe and their lives moving forward. We’re also a mission-driven company. We cultivate a culture that gives every CrowdStriker both the flexibility and autonomy to own their careers. We’re always looking to add talented CrowdStrikers to the team who have limitless passion, a relentless focus on innovation and a fanatical commitment to our customers, our community and each other. Ready to join a mission that matters? The future of cybersecurity starts with you.

About the Role:

We're seeking a Sr. Engineer - ML Platform (Infrastructure & Debugging Specialist) to maintain and optimize CrowdStrike's mission-critical ML infrastructure. You'll diagnose complex distributed systems issues and ensure platform reliability for infrastructure processing billions of events daily.

What You'll Do:

Platform Reliability & Debugging: Diagnose and resolve issues across Ray, Spark, Airflow, MLflow, JupyterHub, Kubeflow, and SLURM Perform root cause analysis on production incidents affecting training and inference pipelines Debug performance bottlenecks, resource contention, memory leaks, and scheduling conflicts Develop debugging tools and diagnostic frameworks

System Optimization & Performance: Profile and optimize Ray clusters and Spark jobs on K8s and Cloud (EMR/Dataproc) Troubleshoot JupyterHub spawner issues, kernel crashes, and resource allocation Optimize SLURM job scheduling, GPU allocation, and HPC cluster utilization

Infrastructure & Monitoring: Build observability solutions and automated health checks Develop runbooks, alerting workflows, and incident response procedures Maintain platform stability metrics (SLAs, error rates, latency)

Collaboration: Partner with ML and ML Platform engineers to resolve workflow issues Conduct post-mortems and mentor on debugging techniques

What You'll Need:

  • 12+ years in distributed systems engineering

  • 5+ years debugging ML platforms in production

  • Deep expertise in 3+ one of: Ray, Spark, JupyterHub, SLURM, K8 Performance profiling, optimization, and capacity planning

Technical Skills (Expertise in at least one):

  • Distributed ML: Ray, Spark, SLURM, Jupyter Ecosystem (debugging failures, performance tuning)

  • ML Platforms: Airflow, MLflow, JupyterHub (troubleshooting core components) Infrastructure: Kubernetes, Docker, AWS/GCP/Azure/OCI

  • Observability: Profiling tools, distributed tracing, Prometheus, Grafana, log aggregation

  • Programming: Expert Python debugging, multi-language proficiency, Linux/Unix

What Sets You Apart: Open-source ML infrastructure contributions Experience with high-throughput inference systems and reducing MTTR Published debugging guides or tools Chaos engineering and GPU/CUDA debugging experience On-call and incident management experience

#LI-DP1

Benefits of Working at CrowdStrike:

  • Market leader in compensation and equity awards

  • Comprehensive physical and mental wellness programs

  • Competitive vacation and holidays for recharge

  • Paid parental and adoption leaves

  • Professional development opportunities for all employees regardless of level or role

  • Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections

  • Vibrant office culture with world class amenities

  • Great Place to Work Certified™ across the globe

CrowdStrike is proud to be an equal opportunity employer. We are committed to fostering a culture of belonging where everyone is valued for who they are and empowered to succeed. We support veterans and individuals with disabilities through our affirmative action program.

CrowdStrike is committed to providing equal employment opportunity for all employees and applicants for employment. The Company does not discriminate in employment opportunities or practices on the basis of race, color, creed, ethnicity, religion, sex (including pregnancy or pregnancy-related medical conditions), sexual orientation, gender identity, marital or family status, veteran status, age, national origin, ancestry, physical disability (including HIV and AIDS), mental disability, medical condition, genetic information, membership or activity in a local human rights commission, status with regard to public assistance, or any other characteristic protected by law. We base all employment decisions--including recruitment, selection, training, compensation, benefits, discipline, promotions, transfers, lay-offs, return from lay-off, terminations and social/recreational programs--on valid job requirements.

If you need assistance accessing or reviewing the information on this website or need help submitting an application for employment or requesting an accommodation, please contact us at [email protected] for further assistance.

Top Skills

Airflow
AWS
Azure
Cuda
Dataproc
Distributed Tracing
Docker
Emr
GCP
Grafana
Jupyter
Jupyterhub
Kubeflow
Kubernetes
Log Aggregation
Mlflow
Oci
Profiling Tools
Prometheus
Python
Ray
Slurm
Spark

CrowdStrike Mumbai, Maharashtra, IND Office

Mumbai, India

Similar Jobs at CrowdStrike

Yesterday
Hybrid
Bangalore, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
This role involves designing, building, and scaling a Data+ML platform, collaborating on machine learning pipelines, and incorporating cloud services for deployment and execution.
Top Skills: AirflowSparkFlinkFluxcdJavaJupyter NotebooksKubernetesMlflowPythonRayTerraformVertex Ai
46 Minutes Ago
Hybrid
Bangalore, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
As a Senior Engineer in Data & ML Platform, you will build ML pipelines, design scalable data platforms, and collaborate with teams on ML solutions to support business decisions.
Top Skills: AirflowSparkFlinkGCPIcebergJavaJupyter NotebooksKubernetesMlflowNvidia WorkbenchPythonScalaTerraform
46 Minutes Ago
Remote or Hybrid
KA, IND
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
The Sr. Software Engineer will develop feature extraction engines, collaborate with data scientists, and test software systems while working with complex file formats and reverse engineering.
Top Skills: AWSAzureBitbucketC++GCPGitJenkinsJIRAPythonRust

What you need to know about the Mumbai Tech Scene

From haggling for the best price at Chor Bazaar to the bustle of Crawford Market, the energy of Mumbai's traditional markets is a key part of the city's charm. And while these markets will always have their place, the city also boasts a thriving e-commerce scene, ranking among the largest in the region. Driven by online sales in everything from snacks to licensed sports merchandise to children's apparel, the local industry is worth billions, with companies actively recruiting to meet the demands of continued growth.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account