A Survey of On-Policy Distillation for Large Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A comprehensive survey introduces On-Policy Distillation (OPD) for Large Language Models (LLMs), addressing the exposure bias inherent in traditional off-policy methods. Off-policy distillation trains student models on static, teacher-generated data, leading to compounding prediction errors during autoregressive inference. OPD, grounded in interactive imitation learning, enables students to generate their own sequences and receive iterative teacher feedback. This survey unifies the fragmented OPD literature through an $f$-divergence framework over on-policy samples. It categorizes methods across three orthogonal dimensions: feedback signal (logit-based, outcome-based, self-play), teacher access (white-box, black-box, teacher-free), and loss granularity (token-level, sequence-level, hybrid). The analysis covers representative techniques like GKD, MiniLLM, and SPIN, examines industrial applications such as DeepSeek-R1's transfer of reasoning from a 671-billion-parameter teacher to 1.5-70 billion parameter students, and identifies future research directions.

Key takeaway

For machine learning engineers deploying smaller LLMs, recognize that traditional off-policy distillation introduces exposure bias, limiting performance on multi-step generation. You should explore On-Policy Distillation (OPD) to mitigate this by allowing your student models to learn from their own generated outputs. Consider implementing hybrid granularity losses and adaptively choosing $f$-divergences like Reverse KL for reasoning tasks to achieve more robust and accurate capability transfer.

Key insights

OPD overcomes off-policy distillation's exposure bias by enabling LLMs to learn from self-generated trajectories with teacher feedback.

Principles

Method

OPD involves student LLMs generating trajectories, then receiving teacher feedback (logit, outcome, or self-play) on these outputs, iteratively refining the policy.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.