NVIDIA Extreme Co-Design Delivers New MLPerf Inference Records

2026-04-01 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, medium

Summary

NVIDIA's Blackwell Ultra GPUs achieved the highest throughput across the widest range of models and scenarios in the MLPerf Inference v6.0 benchmarks, bringing their cumulative MLPerf training and inference wins since 2018 to 291. This round introduced new tests for models like DeepSeek-R1 Interactive, Qwen3-VL-235B-A22B (the first multi-modal model), GPT-OSS-120B, WAN-2.2-T2V-A14B (text-to-video), and DLRMv3 (generative recommendation). NVIDIA was the only platform to submit results on all new models and scenarios, demonstrating up to 2.7x performance gains on DeepSeek-R1 and 1.5x on Llama 3.1 405B on GB300 NVL72 systems, attributed to software optimizations like TensorRT-LLM updates and scale-out inference with Quantum-X800 InfiniBand.

Key takeaway

For AI engineers and CTOs evaluating infrastructure for large-scale AI deployments, NVIDIA's MLPerf Inference v6.0 results indicate that their full-stack approach, combining Blackwell Ultra GPUs with optimized software like TensorRT-LLM and Quantum-X800 InfiniBand, delivers industry-leading throughput and cost efficiency. You should consider NVIDIA's integrated platform for demanding generative AI, multimodal, and recommendation workloads to maximize token output and reduce operational costs.

Key insights

Co-designed hardware, software, and models are crucial for maximizing AI factory throughput and minimizing token cost.

Principles

Rigorous benchmarks are essential for real-world AI inference performance.
Continuous software optimization improves existing hardware performance.
Scale-out networking enables massive token processing rates.

Method

NVIDIA achieved performance gains through faster kernels, Optimized Attention Data Parallel, disaggregated serving, Wide Expert Parallel (WideEP), Multi-Token Prediction (MTP), and KV-aware routing.

In practice

Utilize TensorRT-LLM for LLM inference serving.
Employ NVIDIA Dynamo for distributed inference serving.
Integrate Quantum-X800 InfiniBand for scale-out inference.

Topics

MLPerf Inference v6.0
NVIDIA Blackwell Ultra GPUs
TensorRT-LLM
DeepSeek-R1
Multi-modal AI Models

Code references

Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.