OPRD: On-Policy Representation Distillation
Summary
On-Policy Representation Distillation (OPRD) is a novel method designed to improve knowledge distillation for large language models by addressing limitations of traditional On-Policy Distillation (OPD). OPD typically supervises student models only in output space, leading to persistent sampling variance from Monte Carlo KL estimates over large vocabularies, such as Qwen's ~150k tokens, and treats the teacher model as a black-box. OPRD overcomes this by aligning student and teacher hidden-state representations across selected layers on the same rollouts, completely bypassing the language model head. This approach theoretically eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD successfully closes the student-teacher performance gap on AIME 2024/2025 and AIMO benchmarks, where output-space OPD baselines plateau. Furthermore, OPRD trains 1.44x faster and uses 54% less memory than top-k OPD.
Key takeaway
For Machine Learning Engineers optimizing large language models, OPRD offers a compelling alternative to traditional on-policy distillation. If you are struggling with sampling variance or high memory usage during distillation, consider implementing OPRD's hidden-state alignment. This approach can significantly close the student-teacher performance gap on complex reasoning benchmarks like AIME and AIMO, while also accelerating training by 1.44x and reducing memory consumption by 54%.
Key insights
OPRD improves on-policy distillation by aligning hidden states, eliminating sampling variance and providing richer structural information.
Principles
- Hidden-state alignment reduces sampling variance.
- Intermediate representations offer richer distillation signals.
- Bypassing LM head improves distillation efficiency.
Method
OPRD aligns student and teacher hidden-state representations across selected layers on the same rollouts, entirely bypassing the language model head to eliminate sampling variance.
In practice
- Distill large language models more efficiently.
- Improve student model performance on reasoning tasks.
- Reduce memory footprint during distillation.
Topics
- Knowledge Distillation
- Representation Learning
- Large Language Models
- Model Compression
- On-Policy Distillation
- AIME
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.