Running Large-Scale GPU Workloads on Kubernetes with Slurm
Summary
Slinky, an open-source project by SchedMD (now part of NVIDIA), integrates Slurm, a cluster management and job scheduling system used by over 65% of TOP500 systems, with Kubernetes. This integration addresses the challenge of running large-scale AI training on Kubernetes while preserving existing Slurm investments. Slinky offers two approaches: slurm-bridge for native Kubernetes workloads and slurm-operator for running full Slurm clusters on Kubernetes. This analysis focuses on slurm-operator, which represents each Slurm component (slurmctld, slurmdbd, slurmd, slurmrestd) as a Kubernetes Custom Resource Definition (CRD). It enables high availability, automatic configuration propagation, and autoscaling of worker pods via HorizontalPodAutoscaler based on Slurm v25.11's OpenMetrics support. NVIDIA uses slurm-operator in production on clusters with over 1,000 GPU worker nodes and 8,000+ GPUs, achieving performance comparable to non-containerized Slurm deployments.
Key takeaway
For CTOs or VPs of Engineering managing large-scale AI training infrastructure, Slinky's slurm-operator offers a path to unify Slurm and Kubernetes environments. This integration allows you to leverage Kubernetes' operational benefits like unified monitoring, automated remediation, and non-disruptive rolling updates, while preserving your Slurm investments. Consider deploying slurm-operator to streamline GPU cluster management and accelerate new cluster provisioning from hours to minutes.
Key insights
Slinky's slurm-operator integrates Slurm with Kubernetes for scalable, high-performance AI training on GPU infrastructure.
Principles
- Containerize Slurm daemons as Kubernetes pods.
- Utilize Kubernetes CRDs for Slurm component management.
- Synchronize Slurm and Kubernetes states bidirectionally.
Method
Slinky slurm-operator defines Slurm components as Kubernetes CRDs, enabling containerized daemons. It uses configless mode, dynamic nodes, and auth/slurm for operation, integrating with existing databases and identity services.
In practice
- Deploy slurm-operator via Helm charts.
- Use NVIDIA GPU Operator for automated GPU management.
- Enable per-job GPU metrics with DCGM Exporter integration.
Topics
- Slurm
- Kubernetes
- Slinky slurm-operator
- GPU Workloads
- NVIDIA GPU Operator
Code references
- SlinkyProject/slurm-bridge
- SlinkyProject/slurm-operator
- NVIDIA/gpu-operator
- NVIDIA/k8s-dra-driver-gpu
- NVIDIA/topograph
Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Architect, DevOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.