An Interpretable Latency Model for Speculative Decoding in LLM Serving
Summary
A new interpretable latency model has been developed for speculative decoding (SD) in large language model (LLM) serving systems. This model addresses the challenge of understanding SD's behavior under varying production loads, where effective batch size is dynamic rather than fixed. The framework infers effective batch size from request rate using Little's Law and breaks down per-request demand into load-independent and load-dependent elements for prefill, drafting, and verification stages. Validated against extensive vLLM measurements across various parameters like model sizes, request rates, and draft lengths, the model accurately predicts observed latency. It clarifies why SD speedups often decrease under higher server loads and details how draft length, acceptance rate, and model sizes influence latency, extending its applicability to Mixture of Experts models.
Key takeaway
For AI Architects optimizing LLM serving systems, understanding this latency model is crucial for configuring speculative decoding effectively. Your decisions on draft length, acceptance rate, and verifier-drafter model sizes directly impact latency under varying loads. Use this framework to predict performance degradation and ensure your deployed systems maintain desired speedups even as request rates fluctuate.
Key insights
A new latency model explains speculative decoding performance in LLM serving under dynamic loads.
Principles
- SD speedups diminish with increased server load.
- Effective batch size can be inferred via Little's Law.
Method
The method decomposes per-request demand into load-independent and load-dependent components for prefill, drafting, and verification, inferring effective batch size from request rate.
In practice
- Configure SD draft length for optimal latency.
- Analyze acceptance rate's impact on serving latency.
Topics
- Speculative Decoding
- LLM Serving
- Latency Modeling
- vLLM
- Mixture-of-Experts
Best for: AI Architect, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.