NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance
Summary
NVIDIA dominated MLPerf Training v6.0, achieving the fastest time to train at scale and highest per-accelerator performance across all benchmarks, including new pretraining tests for DeepSeek-V3 (671B-parameter MoE) and GPT-OSS-20B (20B-parameter MoE). The NVIDIA GB300 NVL72 system, integrating 72 Blackwell Ultra GPUs and 36 Grace CPUs, set new performance records. The platform demonstrated robust scaling up to 8,192 Blackwell GPUs in cloud environments, leveraging NVIDIA Spectrum-X Ethernet and Quantum InfiniBand for efficient scale-out networking. Key software innovations, such as full-iteration CUDA graphs, CuTe DSL kernel fusions (yielding >8% on DeepSeek-V3 and 93% on GPT-OSS), MXFP8 attention, and various router and communication optimizations, contributed to these results. Continuous full-stack co-design improved GB300 DeepSeek-V3 throughput by 1.3x in three months, from 1,298 TFLOPS/GPU to 1,648 TFLOPS/GPU. The Blackwell Ultra GB300 also showed significant performance uplift over GB200, with gains up to 1.6x for DeepSeek-V3.
Key takeaway
For AI Architects designing large-scale training infrastructure, NVIDIA's MLPerf v6.0 results confirm the Blackwell platform's robust performance and scalability. You should prioritize systems that integrate advanced networking like Spectrum-X and a continuously optimized software stack. This approach, demonstrated by 1.3x throughput gains in three months, ensures your deployments capture immediate efficiency dividends and accelerate time-to-market for generative AI models.
Key insights
NVIDIA's Blackwell platform achieved MLPerf Training v6.0 dominance through full-stack hardware-software co-design and advanced optimizations.
Principles
- Full-stack co-design drives continuous performance gains.
- Efficient scale-out networking is critical for MoE models.
- Eliminating CPU-GPU synchronization boosts large-scale training.
Method
NVIDIA optimized MoE training by implementing full-iteration CUDA graphs, CuTe DSL kernel fusions, MXFP8 attention, and 1F1B all-to-all overlap, alongside network fabric enhancements for large-scale GPU clusters.
In practice
- Utilize full-iteration CUDA graphs for dynamic MoE architectures.
- Employ CuTe DSL for advanced kernel fusions and data locality.
- Adopt MXFP8 precision for attention blocks to improve throughput.
Topics
- MLPerf Training
- NVIDIA Blackwell
- Mixture of Experts
- CUDA Graphs
- Kernel Fusion
- Spectrum-X Ethernet
- AI Infrastructure
Code references
Best for: MLOps Engineer, Research Scientist, Investor, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.