OPRD: On-Policy Representation Distillation
Summary
On-Policy Representation Distillation (OPRD) is introduced as a novel method to overcome limitations in existing on-policy distillation (OPD) for large language models. Current OPD variants supervise students solely in the output space, leading to high sampling variance and an information bottleneck from the LM head. OPRD shifts supervision to the hidden-state space, aligning student and teacher intermediate representations across selected layers and response positions on the same on-policy rollouts. This provides dense, deterministic supervision, eliminating sampling variance and exposing richer structural information. Empirically, OPRD closes the student–teacher gap on AIME 2024, AIME 2025, and AIMO mathematics benchmarks, while training 1.44x faster and using up to 54% less actor-update transient memory than top-k OPD on an 8x A100 (80G) GPU setup.
Key takeaway
For machine learning engineers optimizing LLM post-training, OPRD offers a superior distillation approach. You should consider integrating OPRD into your pipelines, either standalone or composably with existing OPD, to achieve higher accuracy, faster training (1.44x speed-up), and significantly reduced memory footprint (up to 54% less). This method effectively closes the student–teacher performance gap on complex reasoning tasks by providing a more stable and informative training signal.
Key insights
OPRD lifts on-policy distillation to hidden-state space, providing deterministic, richer supervision and eliminating output-space limitations.
Principles
- Output-space distillation suffers from high sampling variance.
- LM-head projection creates an information bottleneck.
- Hidden-state alignment offers richer, deterministic supervision.
Method
OPRD aligns student's intermediate hidden representations with the teacher's across selected transformer layers and response positions using a normalized mean-squared error objective on on-policy rollouts.
In practice
- Integrate OPRD for multi-model RL merging to reduce memory.
- Apply OPRD in on-policy self-distillation for lower variance.
- Combine OPRD with existing OPD objectives for additive gains.
Topics
- On-Policy Distillation
- Representation Distillation
- Large Language Models
- Hidden States
- Mathematical Reasoning
- Model Compression
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.