NVIDIA Extreme Co-Design Delivers New MLPerf Inference Records
Summary
NVIDIA's Blackwell Ultra GPUs achieved the highest throughput across the widest range of models and scenarios in the MLPerf Inference v6.0 benchmarks, bringing their cumulative MLPerf training and inference wins since 2018 to 291. This round introduced new tests for models like DeepSeek-R1 Interactive, Qwen3-VL-235B-A22B (the first multi-modal model), GPT-OSS-120B, WAN-2.2-T2V-A14B (text-to-video), and DLRMv3 (generative recommendation). NVIDIA was the only platform to submit results on all new models and scenarios, demonstrating up to 2.7x performance gains on DeepSeek-R1 and 1.5x on Llama 3.1 405B on GB300 NVL72 systems, attributed to software optimizations like TensorRT-LLM updates and scale-out inference with Quantum-X800 InfiniBand.
Key takeaway
For AI engineers and CTOs evaluating infrastructure for large-scale AI deployments, NVIDIA's MLPerf Inference v6.0 results indicate that their full-stack approach, combining Blackwell Ultra GPUs with optimized software like TensorRT-LLM and Quantum-X800 InfiniBand, delivers industry-leading throughput and cost efficiency. You should consider NVIDIA's integrated platform for demanding generative AI, multimodal, and recommendation workloads to maximize token output and reduce operational costs.
Key insights
Co-designed hardware, software, and models are crucial for maximizing AI factory throughput and minimizing token cost.
Principles
- Rigorous benchmarks are essential for real-world AI inference performance.
- Continuous software optimization improves existing hardware performance.
- Scale-out networking enables massive token processing rates.
Method
NVIDIA achieved performance gains through faster kernels, Optimized Attention Data Parallel, disaggregated serving, Wide Expert Parallel (WideEP), Multi-Token Prediction (MTP), and KV-aware routing.
In practice
- Utilize TensorRT-LLM for LLM inference serving.
- Employ NVIDIA Dynamo for distributed inference serving.
- Integrate Quantum-X800 InfiniBand for scale-out inference.
Topics
- MLPerf Inference v6.0
- NVIDIA Blackwell Ultra GPUs
- TensorRT-LLM
- DeepSeek-R1
- Multi-modal AI Models
Code references
Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.