BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Summary
BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference, addresses the sparse/full mismatch that causes acceptance rates to drop in resource-constrained speculative decoding deployments. Speculative decoding uses a sparse KV cache for the drafter and a full KV cache for the verifier, but this mismatch degrades performance as context length grows (4K-16K). BudgetDraft exposes the drafter to multiple sampled KV budgets during training, aligning each sparse view with a shared full-cache teacher target. It combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch. This approach achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup versus autoregressive decoding at 4K, 8K, and 16K context lengths, respectively, on PG-19, LongBench, and LWM benchmarks, while maintaining memory efficiency.
Key takeaway
For MLOps Engineers deploying LLMs with speculative decoding, especially for mid-to-long contexts (4K-16K) under GPU memory constraints, BudgetDraft offers a significant performance improvement. You should evaluate BudgetDraft for your deployments to recover acceptance rates and boost inference speed by up to 6.55x at 4K context length, without increasing your memory footprint. This method provides a robust drafter without extra inference-time components.
Key insights
BudgetDraft uses multi-view sparse training to align sparse and full KV caches, improving speculative decoding acceptance rates.
Principles
- Sparse/full KV cache mismatch degrades speculative decoding performance.
- Training with varied KV budgets improves drafter robustness across sparsity levels.
- Aligning sparse views with a full-cache teacher enhances acceptance rates.
Method
BudgetDraft trains a drafter with multiple sampled KV budgets, aligning each sparse view to a shared full-cache teacher. It combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch.
In practice
- Apply multi-view training to improve sparse KV cache performance.
- Consider acceptance-aware loss for speculative decoding drafters.
- Utilize BudgetDraft for memory-friendly long-context inference.
Topics
- Speculative Decoding
- Sparse KV Cache
- Multi-View Training
- Long Context Inference
- GPU Memory Optimization
- LLM Speedup
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.