**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Summary
NVIDIA introduced SPEED-Bench on March 19, 2026, a unified benchmark for evaluating Speculative Decoding (SD) in Large Language Models (LLMs). SD accelerates LLM inference by using a lightweight draft model to speculate future tokens, which are then verified by the target model, improving throughput while preserving output distribution. Existing SD benchmarks are often fragmented and unrepresentative of real-world data and serving conditions, failing to account for data-dependent, serving-regime-dependent, and system-dependent factors. SPEED-Bench addresses these gaps with two dataset splits: a "Qualitative" split with 880 prompts across 11 diverse semantic categories to measure draft accuracy (conditional acceptance rates and lengths), and a "Throughput" split with 1,536 prompts per Input Sequence Length (ISL) bucket (1k to 32k tokens) to evaluate system-level speedups under high concurrency. It also includes a unified measurement framework integrated with production inference engines like TensorRT-LLM, vLLM, and SGLang, ensuring consistent evaluation by handling tokenization and prompt formatting externally.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM inference, SPEED-Bench provides a robust framework to evaluate Speculative Decoding performance. You should integrate SPEED-Bench into your evaluation workflows to accurately assess draft model quality across diverse semantic domains and measure system-level speedups under realistic production serving conditions, avoiding the pitfalls of less representative benchmarks like those using random token inputs.
Key insights
SPEED-Bench offers a unified benchmark for Speculative Decoding, addressing real-world data diversity and serving conditions.
Principles
- SD performance is data, serving-regime, and system-dependent.
- Semantic diversity is critical for exposing domain-dependent SD behavior.
- Random tokens distort SD and MoE throughput measurements.
Method
SPEED-Bench uses two data splits (Qualitative for semantic diversity, Throughput for system speedups) and a unified measurement framework that pre-tokenizes inputs for consistent evaluation across production inference engines.
In practice
- Evaluate SD draft accuracy across diverse semantic domains.
- Measure system-level speedups under varying batch sizes and ISLs.
- Avoid random token inputs for SD and MoE throughput benchmarks.
Topics
- Speculative Decoding
- LLM Inference Benchmarking
- Semantic Diversity
- Inference Throughput
- Production Inference Engines
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.