Why MoE models get more from speculative decoding - Cohere
Summary
Cohere's analysis of Mixture-of-Experts (MoE) models with speculative decoding (SD) validates prior predictions of a non-monotonic speedup curve, where gains initially increase with batch size before declining. Benchmarking a Cohere MoE model with SD ($K=3$) revealed a sweet spot at moderate batch sizes, contrasting with the monotonic decrease observed in a dense model (Command A, 111B). This behavior is attributed to MoE's low arithmetic intensity, which keeps the model bandwidth-bound longer. The study also quantifies the impact of temporal correlation in expert routing, showing it reduces unique expert loading by 20-31% compared to independence baselines. Furthermore, an Amdahl's Law decomposition explains an additional speedup at batch size 1 due to fixed-overhead amortization of non-expert operations, a factor not fully captured by routing analysis alone.
Key takeaway
For NLP Engineers optimizing MoE model inference, understanding the interplay between model sparsity, expert routing, and batch size is crucial. You should co-optimize $k/N$ and the shared-to-routed expert ratio based on your target batch size to maximize speculative decoding gains. This ensures your MoE models operate within the bandwidth-bound "sweet spot" for optimal performance, especially considering the benefits of temporal correlation and fixed-overhead amortization.
Key insights
MoE sparsity and expert routing significantly enhance speculative decoding performance, especially at moderate batch sizes.
Principles
- MoE's low arithmetic intensity creates a non-monotonic SD speedup curve.
- Temporal correlation in expert routing reduces unique expert loading.
- Fixed-overhead amortization boosts SD at very low batch sizes.
Method
The study used vLLM to benchmark SD speedup across batch sizes for dense and MoE models, analyzing expert routing decisions via a modified `enable_return_routed_experts` API and applying Amdahl's Law.
In practice
- Lower $k/N$ to maximize SD benefit at high target batch sizes.
- Increase routed-to-shared expert ratio for high target batch sizes.
- Shared experts are beneficial for low target batch sizes.
Topics
- Speculative Decoding
- Mixture-of-Experts
- Expert Routing
- Arithmetic Intensity
- Inference Optimization
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cohere.com via Google News.