A new GPT-OSS benchmark and DeepSeek R1 updates for latency-optimized reasoning

2026-03-24 · Source: MLCommons · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

The MLPerf Inference v6.0 release significantly expands its coverage of open-weight large language models, introducing two major additions. First, the GPT-OSS 120B benchmark features a 117B total parameter Mixture-of-Experts (MoE) model (5.1B active per token) designed for mathematics, scientific reasoning, and coding. This benchmark uniquely separates performance (low-effort summarization, 10,240 tokens max output) and accuracy datasets (high-effort tasks like AIME 2024, LiveCodeBench v6, GPQA-Diamond, 32,768 tokens max output), with specific accuracy targets of 82.92%, 74.95%, and 84.68% respectively. Second, a new DeepSeek-R1 interactive scenario adds a low-latency workload for real-time reasoning, featuring 99th percentile TTFT <= 1.5s and TPOT <= 15ms. This scenario also introduces the first standard for speculative decoding in MLPerf, specifically using EAGLE-style decoding with the DeepSeek-R1 MTP Head.

Key takeaway

For MLOps Engineers deploying open-weight LLMs, MLPerf Inference v6.0 provides critical new benchmarks. You should now evaluate models like GPT-OSS 120B using separate performance and accuracy datasets to reflect diverse production workloads. For real-time reasoning applications, implement speculative decoding with the DeepSeek-R1 MTP Head to meet strict latency targets, ensuring your deployments remain competitive and efficient.

Key insights

MLPerf Inference v6.0 introduces new benchmarks for open LLMs, separating performance and accuracy, and standardizing speculative decoding.

Principles

Benchmarks must reflect evolving LLM architectures.
Separate datasets optimize performance and accuracy evaluation.
Low-latency scenarios require specific decoding standards.

Method

MLPerf v6.0 defines a split-dataset strategy for GPT-OSS, using distinct datasets for performance (summarization) and accuracy (complex reasoning), and mandates EAGLE-style speculative decoding for DeepSeek-R1 interactive.

In practice

Evaluate LLMs with split performance/accuracy datasets.
Implement EAGLE-style speculative decoding for low latency.
Use OpenAI Harmony chat format for reasoning control.

Topics

MLPerf Inference
Large Language Models
GPT-OSS 120B
DeepSeek-R1
Speculative Decoding
LLM Benchmarking
Mixture-of-Experts

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLCommons.