A Survey of On-Policy Distillation for Large Language Models
Summary
A comprehensive survey introduces On-Policy Distillation (OPD) for Large Language Models (LLMs), addressing the exposure bias inherent in traditional off-policy methods. Off-policy distillation trains student models on static, teacher-generated data, leading to compounding prediction errors during autoregressive inference. OPD, grounded in interactive imitation learning, enables students to generate their own sequences and receive iterative teacher feedback. This survey unifies the fragmented OPD literature through an $f$-divergence framework over on-policy samples. It categorizes methods across three orthogonal dimensions: feedback signal (logit-based, outcome-based, self-play), teacher access (white-box, black-box, teacher-free), and loss granularity (token-level, sequence-level, hybrid). The analysis covers representative techniques like GKD, MiniLLM, and SPIN, examines industrial applications such as DeepSeek-R1's transfer of reasoning from a 671-billion-parameter teacher to 1.5-70 billion parameter students, and identifies future research directions.
Key takeaway
For machine learning engineers deploying smaller LLMs, recognize that traditional off-policy distillation introduces exposure bias, limiting performance on multi-step generation. You should explore On-Policy Distillation (OPD) to mitigate this by allowing your student models to learn from their own generated outputs. Consider implementing hybrid granularity losses and adaptively choosing $f$-divergences like Reverse KL for reasoning tasks to achieve more robust and accurate capability transfer.
Key insights
OPD overcomes off-policy distillation's exposure bias by enabling LLMs to learn from self-generated trajectories with teacher feedback.
Principles
- Off-policy training creates train-test mismatch.
- On-policy feedback reduces autoregressive error.
- Divergence choice shapes student mode-seeking.
Method
OPD involves student LLMs generating trajectories, then receiving teacher feedback (logit, outcome, or self-play) on these outputs, iteratively refining the policy.
In practice
- Match $f$-divergence to task (e.g., Reverse KL for math).
- Combine token and sequence losses for complex reasoning.
- Leverage privileged information in self-distillation.
Topics
- On-Policy Distillation
- Large Language Models
- Knowledge Distillation
- Exposure Bias
- f-Divergence
- Self-Distillation
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.