Lambda's MLPerf Inference v6.0: hardware leap, software maturity, research breakthrough

2026-04-01 · Source: The Lambda Deep Learning Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Lambda's MLPerf Inference v6.0 results demonstrate significant advancements in AI inference performance across hardware, software, and routing optimizations. The NVIDIA Blackwell Ultra GPU system achieved up to 29% higher iso-GPU throughput on GPT-OSS-120B compared to NVIDIA HGX B200. Software stack improvements, specifically upgrading from NVIDIA CUDA 12.9 to CUDA 13.1 on NVIDIA HGX B200, yielded up to a 9% throughput gain on Llama 3.1 8B. Additionally, a collaboration with Stevens Institute of Technology introduced BLAZE, a runtime MoE routing optimization, which reduced time-to-first-token (TTFT) P99 latency by 31% on GPT-OSS-120B without requiring model retraining. These results highlight progress in closing the gap between benchmark performance and real-world production deployments for frontier AI models.

Key takeaway

AI Architects evaluating new infrastructure should note the NVIDIA Blackwell Ultra GPU's 29% throughput increase for MoE models like GPT-OSS 120B, allowing the entire model to fit on a single GPU. For teams currently on NVIDIA HGX B200, a software update to CUDA 13.1 can deliver a 9% throughput boost without hardware changes. If latency is a critical constraint, consider integrating BLAZE for a 31% reduction in TTFT P99 on MoE models, enhancing user experience without model retraining.

Key insights

Hardware, software, and routing optimizations significantly boost AI inference performance and reduce latency.

Principles

Generational hardware lifts improve throughput.
Software stack maturity directly impacts performance.
Runtime optimizations can reduce latency without retraining.

Method

BLAZE optimizes MoE routing by dynamically biasing scores to steer ambiguous tokens away from overloaded experts, reducing TTFT P99 latency by 31% with minimal overhead.

In practice

Upgrade CUDA for software-driven throughput gains.
Consider Blackwell Ultra for MoE model performance.
Implement BLAZE for MoE latency reduction.

Topics

MLPerf Inference v6.0
NVIDIA Blackwell Ultra GPUs
BLAZE Routing Optimization
Large Language Model Inference
GPU Performance Benchmarking

Code references

deepseek-ai/EPLB

Best for: AI Architect, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Lambda Deep Learning Blog.