Draft-OPD: On-Policy Distillation for Speculative Draft Models

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Draft-OPD introduces an on-policy distillation method to significantly accelerate large language model inference by improving speculative draft models. While common draft models like EAGLE-3 or DFlash use supervised fine-tuning (SFT), this approach often plateaus due to an offline-to-inference mismatch, where the drafter learns from fixed target-generated trajectories but is evaluated on its own proposed blocks. Draft-OPD addresses this by employing target-assisted rollout for stable continuations and replaying drafting from verification-exposed error positions. This allows the draft model to learn from target feedback on both accepted and rejected proposals, specifically targeting the draft-induced errors that limit speculative acceptance. Experiments demonstrate that Draft-OPD achieves over 5x lossless acceleration for thinking models across diverse tasks, improving upon EAGLE-3 by 23% and DFlash by 13%.

Key takeaway

For Machine Learning Engineers optimizing large language model inference, Draft-OPD offers a significant advancement over traditional supervised fine-tuning. You should consider implementing this on-policy distillation approach to overcome the SFT plateau, as it directly addresses the offline-to-inference mismatch. This method can yield over 5x lossless acceleration, substantially improving throughput and efficiency for your LLM deployments.

Key insights

Draft-OPD improves speculative draft models by learning from their own policy's errors during target-assisted rollouts.

Principles

Supervised fine-tuning for draft models plateaus due to offline-to-inference mismatch.
On-policy distillation can address this by learning from draft-induced states.
Learning from both accepted and rejected proposals is crucial for draft model improvement.

Method

Draft-OPD uses target-assisted rollout for stable continuations and replays drafting from verification-exposed error positions, focusing training on draft-induced errors that limit acceptance.

In practice

Focus draft model training on errors exposed during verification.
Utilize target-assisted rollouts for stable sequence generation.
Consider on-policy distillation to overcome SFT limitations.

Topics

Speculative Decoding
Large Language Models
On-Policy Distillation
Draft Models
Inference Acceleration
Supervised Fine-Tuning

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.