A new GPT-OSS benchmark and DeepSeek R1 updates for latency-optimized reasoning
Summary
The MLPerf Inference v6.0 release significantly expands its coverage of open-weight large language models, introducing two major additions. First, the GPT-OSS 120B benchmark features a 117B total parameter Mixture-of-Experts (MoE) model (5.1B active per token) designed for mathematics, scientific reasoning, and coding. This benchmark uniquely separates performance (low-effort summarization, 10,240 tokens max output) and accuracy datasets (high-effort tasks like AIME 2024, LiveCodeBench v6, GPQA-Diamond, 32,768 tokens max output), with specific accuracy targets of 82.92%, 74.95%, and 84.68% respectively. Second, a new DeepSeek-R1 interactive scenario adds a low-latency workload for real-time reasoning, featuring 99th percentile TTFT <= 1.5s and TPOT <= 15ms. This scenario also introduces the first standard for speculative decoding in MLPerf, specifically using EAGLE-style decoding with the DeepSeek-R1 MTP Head.
Key takeaway
For MLOps Engineers deploying open-weight LLMs, MLPerf Inference v6.0 provides critical new benchmarks. You should now evaluate models like GPT-OSS 120B using separate performance and accuracy datasets to reflect diverse production workloads. For real-time reasoning applications, implement speculative decoding with the DeepSeek-R1 MTP Head to meet strict latency targets, ensuring your deployments remain competitive and efficient.
Key insights
MLPerf Inference v6.0 introduces new benchmarks for open LLMs, separating performance and accuracy, and standardizing speculative decoding.
Principles
- Benchmarks must reflect evolving LLM architectures.
- Separate datasets optimize performance and accuracy evaluation.
- Low-latency scenarios require specific decoding standards.
Method
MLPerf v6.0 defines a split-dataset strategy for GPT-OSS, using distinct datasets for performance (summarization) and accuracy (complex reasoning), and mandates EAGLE-style speculative decoding for DeepSeek-R1 interactive.
In practice
- Evaluate LLMs with split performance/accuracy datasets.
- Implement EAGLE-style speculative decoding for low latency.
- Use OpenAI Harmony chat format for reasoning control.
Topics
- MLPerf Inference
- Large Language Models
- GPT-OSS 120B
- DeepSeek-R1
- Speculative Decoding
- LLM Benchmarking
- Mixture-of-Experts
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLCommons.