Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling
Summary
The NVIDIA GB200 NVL72 introduces a new GPU cluster design that extends NVLink coherence across an entire rack, unifying 72 NVIDIA Blackwell GPUs across 18 compute trays with fifth-generation NVLink. This architecture provides 1.8 TB/s bidirectional throughput per GPU, totaling 130 TB/s aggregate bandwidth within the rack. However, communication crossing NVLink domain boundaries incurs a steep performance drop, typically to 50 GB/s via InfiniBand or Ethernet. This necessitates new workload scheduling algorithms that treat NVLink domains as hard boundaries. To address this, the Slurm workload manager introduced the topology/block plugin in version 23.11, and further enhanced it with segmented scheduling, allowing administrators and users to define application-specific NVLink requirements as atomic blocks to optimize job placement and performance, moving from prototype to production-grade rack-scale orchestration.
Key takeaway
For MLOps Engineers and AI Architects deploying on NVIDIA GB200 NVL72 clusters, properly configuring Slurm's topology/block plugin is critical. You must define NVL72 domains as blocks in topology.yaml and educate users on effective `--segment` usage to prevent performance degradation and optimize resource utilization. Incorrect configurations will lead to fragmented allocations, increased queue times, and significantly reduced application performance due to the strict NVLink domain boundaries.
Key insights
NVIDIA GB200 NVL72's rack-scale NVLink coherence requires Slurm's topology/block plugin for optimal scheduling.
Principles
- Rack-scale locality is a hard constraint.
- NVLink domains are hard boundaries for jobs.
- Segment size impacts usable cluster capacity.
Method
Configure Slurm's topology/block plugin using topology.yaml to define NVL72 domains as blocks. Use the `--segment` argument to specify atomic node groups for jobs, balancing scheduler efficiency and hardware locality. Enable the switch/nvidia_imex plugin for driver-level isolation.
In practice
- Define one Slurm block per GB200 NVL72 domain (18 nodes).
- Use `--segment=1` for maximum placement flexibility.
- Avoid `--segment=18` in favor of `--segment=16` for better availability.
Topics
- NVIDIA GB200 NVL72
- Slurm Workload Manager
- Block Scheduling
- NVLink Topology
- Segmented Scheduling
Best for: MLOps Engineer, AI Architect, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.