A Survey of On-Policy Distillation for Large Language Models

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A comprehensive survey introduces On-Policy Distillation (OPD) for Large Language Models (LLMs), addressing the exposure bias inherent in traditional off-policy methods. Off-policy distillation trains student models on static, teacher-generated data, leading to compounding prediction errors during autoregressive inference. OPD, grounded in interactive imitation learning, enables students to generate their own sequences and receive iterative teacher feedback. This survey unifies the fragmented OPD literature through an $f$-divergence framework over on-policy samples. It categorizes methods across three orthogonal dimensions: feedback signal (logit-based, outcome-based, self-play), teacher access (white-box, black-box, teacher-free), and loss granularity (token-level, sequence-level, hybrid). The analysis covers representative techniques like GKD, MiniLLM, and SPIN, examines industrial applications such as DeepSeek-R1's transfer of reasoning from a 671-billion-parameter teacher to 1.5-70 billion parameter students, and identifies future research directions.

Key takeaway

For machine learning engineers deploying smaller LLMs, recognize that traditional off-policy distillation introduces exposure bias, limiting performance on multi-step generation. You should explore On-Policy Distillation (OPD) to mitigate this by allowing your student models to learn from their own generated outputs. Consider implementing hybrid granularity losses and adaptively choosing $f$-divergences like Reverse KL for reasoning tasks to achieve more robust and accurate capability transfer.

Key insights

OPD overcomes off-policy distillation's exposure bias by enabling LLMs to learn from self-generated trajectories with teacher feedback.

Principles

Off-policy training creates train-test mismatch.
On-policy feedback reduces autoregressive error.
Divergence choice shapes student mode-seeking.

Method

OPD involves student LLMs generating trajectories, then receiving teacher feedback (logit, outcome, or self-play) on these outputs, iteratively refining the policy.

In practice

Match $f$-divergence to task (e.g., Reverse KL for math).
Combine token and sequence losses for complex reasoning.
Leverage privileged information in self-distillation.

Topics

On-Policy Distillation
Large Language Models
Knowledge Distillation
Exposure Bias
f-Divergence
Self-Distillation

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.