Draft-OPD: On-Policy Distillation for Speculative Draft Models
Summary
Draft-OPD introduces an on-policy distillation method to significantly accelerate large language model inference by improving speculative draft models. While common draft models like EAGLE-3 or DFlash use supervised fine-tuning (SFT), this approach often plateaus due to an offline-to-inference mismatch, where the drafter learns from fixed target-generated trajectories but is evaluated on its own proposed blocks. Draft-OPD addresses this by employing target-assisted rollout for stable continuations and replaying drafting from verification-exposed error positions. This allows the draft model to learn from target feedback on both accepted and rejected proposals, specifically targeting the draft-induced errors that limit speculative acceptance. Experiments demonstrate that Draft-OPD achieves over 5x lossless acceleration for thinking models across diverse tasks, improving upon EAGLE-3 by 23% and DFlash by 13%.
Key takeaway
For Machine Learning Engineers optimizing large language model inference, Draft-OPD offers a significant advancement over traditional supervised fine-tuning. You should consider implementing this on-policy distillation approach to overcome the SFT plateau, as it directly addresses the offline-to-inference mismatch. This method can yield over 5x lossless acceleration, substantially improving throughput and efficiency for your LLM deployments.
Key insights
Draft-OPD improves speculative draft models by learning from their own policy's errors during target-assisted rollouts.
Principles
- Supervised fine-tuning for draft models plateaus due to offline-to-inference mismatch.
- On-policy distillation can address this by learning from draft-induced states.
- Learning from both accepted and rejected proposals is crucial for draft model improvement.
Method
Draft-OPD uses target-assisted rollout for stable continuations and replays drafting from verification-exposed error positions, focusing training on draft-induced errors that limit acceptance.
In practice
- Focus draft model training on errors exposed during verification.
- Utilize target-assisted rollouts for stable sequence generation.
- Consider on-policy distillation to overcome SFT limitations.
Topics
- Speculative Decoding
- Large Language Models
- On-Policy Distillation
- Draft Models
- Inference Acceleration
- Supervised Fine-Tuning
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.