Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
Summary
This study introduces Speculative Decoding Scaling Laws (SDSL), an analytical framework that optimizes Large Language Model (LLM) inference throughput by systematically selecting draft models for speculative decoding. Unlike prior empirical methods, SDSL analytically connects key hyperparameters of pre-trained LLMs to inference system throughput efficiency, enabling prediction of optimal hyperparameters before pre-training. The framework establishes an affine relationship between draft model perplexity, target model perplexity, and token acceptance rate (alpha), with draft model quality being the dominant factor. It derives a numerical relationship, $N_{\text{opt}}=M_{0}+\mu M$, showing that the throughput-optimal draft model should be approximately 200 times smaller than the target model, a relationship robust across different model families. The analysis, validated with wall-clock latency measurements, also indicates that training dataset size has only a mild impact on throughput.
Key takeaway
For AI Engineers and Research Scientists designing LLM inference systems, this framework provides a principled approach to selecting optimal draft model sizes for speculative decoding. You should prioritize draft models that are significantly smaller, specifically around 200 times smaller than your target model, to maximize throughput. This analytical method reduces the need for extensive empirical search and computational resources, streamlining the deployment of efficient LLM inference.
Key insights
SDSL analytically optimizes speculative decoding throughput by predicting optimal draft model sizes based on scaling laws.
Principles
- Draft model quality is the dominant factor for token acceptance.
- Optimal draft model size scales linearly with target model size.
- Dataset size has mild impact on throughput optimization.
Method
The method involves modeling token acceptance rate as an affine function of draft and target model perplexities, then integrating this with pre-training scaling laws to derive throughput in terms of model size and training data.
In practice
- Select draft models approximately 200x smaller than target models.
- Prioritize draft model perplexity for throughput gains.
- Reuse pre-training scaling law parameters for SDSL coefficients.
Topics
- Speculative Decoding
- LLM Inference Optimization
- Scaling Laws
- Draft Model Sizing
- Throughput Optimization
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.