Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

This study introduces Speculative Decoding Scaling Laws (SDSL), an analytical framework that optimizes Large Language Model (LLM) inference throughput by systematically selecting draft models for speculative decoding. Unlike prior empirical methods, SDSL analytically connects key hyperparameters of pre-trained LLMs to inference system throughput efficiency, enabling prediction of optimal hyperparameters before pre-training. The framework establishes an affine relationship between draft model perplexity, target model perplexity, and token acceptance rate (alpha), with draft model quality being the dominant factor. It derives a numerical relationship, $N_{\text{opt}}=M_{0}+\mu M$, showing that the throughput-optimal draft model should be approximately 200 times smaller than the target model, a relationship robust across different model families. The analysis, validated with wall-clock latency measurements, also indicates that training dataset size has only a mild impact on throughput.

Key takeaway

For AI Engineers and Research Scientists designing LLM inference systems, this framework provides a principled approach to selecting optimal draft model sizes for speculative decoding. You should prioritize draft models that are significantly smaller, specifically around 200 times smaller than your target model, to maximize throughput. This analytical method reduces the need for extensive empirical search and computational resources, streamlining the deployment of efficient LLM inference.

Key insights

SDSL analytically optimizes speculative decoding throughput by predicting optimal draft model sizes based on scaling laws.

Principles

Method

The method involves modeling token acceptance rate as an affine function of draft and target model perplexities, then integrating this with pre-training scaling laws to derive throughput in terms of model size and training data.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.