Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

NVIDIA's GB200 NVL72 and GB300 NVL72 systems, built on the Blackwell architecture, are rack-scale supercomputers designed with 18 compute trays and high-bandwidth networking. Operationalizing these systems for AI and HPC workloads requires bridging the gap between their hierarchical hardware topology and flat scheduler abstractions. NVIDIA Mission Control addresses this by providing rack-scale control planes for Grace Blackwell NVL72 systems, integrating with workload managers like Slurm and NVIDIA Run:ai. It leverages system-level identifiers like Cluster UUID and Clique ID to reflect NVLink and IMEX domains, enabling schedulers to make topology-aware placement decisions. This ensures optimal performance, isolation, and manageability for multi-node NVLink workloads, extending support to Kubernetes via ComputeDomains and the NVIDIA k8s-dra-driver-gpu, and automating topology discovery with the open-source Topograph tool.

Key takeaway

For AI Architects and HPC platform operators deploying NVIDIA Grace Blackwell NVL72 systems, understanding and implementing NVIDIA Mission Control with Slurm, Kubernetes, or Run:ai is crucial. This approach ensures your high-bandwidth NVLink and IMEX domains are correctly exposed to schedulers, preventing performance bottlenecks from misaligned job placements and enabling efficient, scalable AI factory operations. Prioritize integrating topology-aware scheduling to maximize hardware utilization and workload performance.

Key insights

NVIDIA Mission Control and related tools bridge hardware topology with scheduler abstractions for optimal AI/HPC workload placement.

Principles

Method

NVIDIA Mission Control integrates with Slurm and Run:ai, using Cluster UUIDs and Clique IDs to map NVLink/IMEX domains to scheduler-consumable blocks or ComputeDomains, ensuring topology-aware job placement and resource isolation.

In practice

Topics

Code references

Best for: AI Architect, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.