Lambda's MLPerf Inference v6.0: hardware leap, software maturity, research breakthrough

· Source: The Lambda Deep Learning Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Lambda's MLPerf Inference v6.0 results demonstrate significant advancements in AI inference performance across hardware, software, and routing optimizations. The NVIDIA Blackwell Ultra GPU system achieved up to 29% higher iso-GPU throughput on GPT-OSS-120B compared to NVIDIA HGX B200. Software stack improvements, specifically upgrading from NVIDIA CUDA 12.9 to CUDA 13.1 on NVIDIA HGX B200, yielded up to a 9% throughput gain on Llama 3.1 8B. Additionally, a collaboration with Stevens Institute of Technology introduced BLAZE, a runtime MoE routing optimization, which reduced time-to-first-token (TTFT) P99 latency by 31% on GPT-OSS-120B without requiring model retraining. These results highlight progress in closing the gap between benchmark performance and real-world production deployments for frontier AI models.

Key takeaway

AI Architects evaluating new infrastructure should note the NVIDIA Blackwell Ultra GPU's 29% throughput increase for MoE models like GPT-OSS 120B, allowing the entire model to fit on a single GPU. For teams currently on NVIDIA HGX B200, a software update to CUDA 13.1 can deliver a 9% throughput boost without hardware changes. If latency is a critical constraint, consider integrating BLAZE for a 31% reduction in TTFT P99 on MoE models, enhancing user experience without model retraining.

Key insights

Hardware, software, and routing optimizations significantly boost AI inference performance and reduce latency.

Principles

Method

BLAZE optimizes MoE routing by dynamically biasing scores to steer ambiguous tokens away from overloaded experts, reducing TTFT P99 latency by 31% with minimal overhead.

In practice

Topics

Code references

Best for: AI Architect, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Lambda Deep Learning Blog.