Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

2026-05-07 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, long

Summary

The NVIDIA GB200 NVL72 introduces a new GPU cluster design that extends NVLink coherence across an entire rack, unifying 72 NVIDIA Blackwell GPUs across 18 compute trays with fifth-generation NVLink. This architecture provides 1.8 TB/s bidirectional throughput per GPU, totaling 130 TB/s aggregate bandwidth within the rack. However, communication crossing NVLink domain boundaries incurs a steep performance drop, typically to 50 GB/s via InfiniBand or Ethernet. This necessitates new workload scheduling algorithms that treat NVLink domains as hard boundaries. To address this, the Slurm workload manager introduced the topology/block plugin in version 23.11, and further enhanced it with segmented scheduling, allowing administrators and users to define application-specific NVLink requirements as atomic blocks to optimize job placement and performance, moving from prototype to production-grade rack-scale orchestration.

Key takeaway

For MLOps Engineers and AI Architects deploying on NVIDIA GB200 NVL72 clusters, properly configuring Slurm's topology/block plugin is critical. You must define NVL72 domains as blocks in topology.yaml and educate users on effective `--segment` usage to prevent performance degradation and optimize resource utilization. Incorrect configurations will lead to fragmented allocations, increased queue times, and significantly reduced application performance due to the strict NVLink domain boundaries.

Key insights

NVIDIA GB200 NVL72's rack-scale NVLink coherence requires Slurm's topology/block plugin for optimal scheduling.

Principles

Rack-scale locality is a hard constraint.
NVLink domains are hard boundaries for jobs.
Segment size impacts usable cluster capacity.

Method

Configure Slurm's topology/block plugin using topology.yaml to define NVL72 domains as blocks. Use the `--segment` argument to specify atomic node groups for jobs, balancing scheduler efficiency and hardware locality. Enable the switch/nvidia_imex plugin for driver-level isolation.

In practice

Define one Slurm block per GB200 NVL72 domain (18 nodes).
Use `--segment=1` for maximum placement flexibility.
Avoid `--segment=18` in favor of `--segment=16` for better availability.

Topics

NVIDIA GB200 NVL72
Slurm Workload Manager
Block Scheduling
NVLink Topology
Segmented Scheduling

Best for: MLOps Engineer, AI Architect, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.