Running Large-Scale GPU Workloads on Kubernetes with Slurm

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Slinky, an open-source project by SchedMD (now part of NVIDIA), integrates Slurm, a cluster management and job scheduling system used by over 65% of TOP500 systems, with Kubernetes. This integration addresses the challenge of running large-scale AI training on Kubernetes while preserving existing Slurm investments. Slinky offers two approaches: slurm-bridge for native Kubernetes workloads and slurm-operator for running full Slurm clusters on Kubernetes. This analysis focuses on slurm-operator, which represents each Slurm component (slurmctld, slurmdbd, slurmd, slurmrestd) as a Kubernetes Custom Resource Definition (CRD). It enables high availability, automatic configuration propagation, and autoscaling of worker pods via HorizontalPodAutoscaler based on Slurm v25.11's OpenMetrics support. NVIDIA uses slurm-operator in production on clusters with over 1,000 GPU worker nodes and 8,000+ GPUs, achieving performance comparable to non-containerized Slurm deployments.

Key takeaway

For CTOs or VPs of Engineering managing large-scale AI training infrastructure, Slinky's slurm-operator offers a path to unify Slurm and Kubernetes environments. This integration allows you to leverage Kubernetes' operational benefits like unified monitoring, automated remediation, and non-disruptive rolling updates, while preserving your Slurm investments. Consider deploying slurm-operator to streamline GPU cluster management and accelerate new cluster provisioning from hours to minutes.

Key insights

Slinky's slurm-operator integrates Slurm with Kubernetes for scalable, high-performance AI training on GPU infrastructure.

Principles

Method

Slinky slurm-operator defines Slurm components as Kubernetes CRDs, enabling containerized daemons. It uses configless mode, dynamic nodes, and auth/slurm for operation, integrating with existing databases and identity services.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Architect, DevOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.