NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

2026-06-16 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, medium

Summary

NVIDIA dominated MLPerf Training v6.0, achieving the fastest time to train at scale and highest per-accelerator performance across all benchmarks, including new pretraining tests for DeepSeek-V3 (671B-parameter MoE) and GPT-OSS-20B (20B-parameter MoE). The NVIDIA GB300 NVL72 system, integrating 72 Blackwell Ultra GPUs and 36 Grace CPUs, set new performance records. The platform demonstrated robust scaling up to 8,192 Blackwell GPUs in cloud environments, leveraging NVIDIA Spectrum-X Ethernet and Quantum InfiniBand for efficient scale-out networking. Key software innovations, such as full-iteration CUDA graphs, CuTe DSL kernel fusions (yielding >8% on DeepSeek-V3 and 93% on GPT-OSS), MXFP8 attention, and various router and communication optimizations, contributed to these results. Continuous full-stack co-design improved GB300 DeepSeek-V3 throughput by 1.3x in three months, from 1,298 TFLOPS/GPU to 1,648 TFLOPS/GPU. The Blackwell Ultra GB300 also showed significant performance uplift over GB200, with gains up to 1.6x for DeepSeek-V3.

Key takeaway

For AI Architects designing large-scale training infrastructure, NVIDIA's MLPerf v6.0 results confirm the Blackwell platform's robust performance and scalability. You should prioritize systems that integrate advanced networking like Spectrum-X and a continuously optimized software stack. This approach, demonstrated by 1.3x throughput gains in three months, ensures your deployments capture immediate efficiency dividends and accelerate time-to-market for generative AI models.

Key insights

NVIDIA's Blackwell platform achieved MLPerf Training v6.0 dominance through full-stack hardware-software co-design and advanced optimizations.

Principles

Full-stack co-design drives continuous performance gains.
Efficient scale-out networking is critical for MoE models.
Eliminating CPU-GPU synchronization boosts large-scale training.

Method

NVIDIA optimized MoE training by implementing full-iteration CUDA graphs, CuTe DSL kernel fusions, MXFP8 attention, and 1F1B all-to-all overlap, alongside network fabric enhancements for large-scale GPU clusters.

In practice

Utilize full-iteration CUDA graphs for dynamic MoE architectures.
Employ CuTe DSL for advanced kernel fusions and data locality.
Adopt MXFP8 precision for attention blocks to improve throughput.

Topics

MLPerf Training
NVIDIA Blackwell
Mixture of Experts
CUDA Graphs
Kernel Fusion
Spectrum-X Ethernet
AI Infrastructure

Code references

NVIDIA/Megatron-LM

Best for: MLOps Engineer, Research Scientist, Investor, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.