OPRD: On-Policy Representation Distillation

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

On-Policy Representation Distillation (OPRD) is a novel method designed to improve knowledge distillation for large language models by addressing limitations of traditional On-Policy Distillation (OPD). OPD typically supervises student models only in output space, leading to persistent sampling variance from Monte Carlo KL estimates over large vocabularies, such as Qwen's ~150k tokens, and treats the teacher model as a black-box. OPRD overcomes this by aligning student and teacher hidden-state representations across selected layers on the same rollouts, completely bypassing the language model head. This approach theoretically eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD successfully closes the student-teacher performance gap on AIME 2024/2025 and AIMO benchmarks, where output-space OPD baselines plateau. Furthermore, OPRD trains 1.44x faster and uses 54% less memory than top-k OPD.

Key takeaway

For Machine Learning Engineers optimizing large language models, OPRD offers a compelling alternative to traditional on-policy distillation. If you are struggling with sampling variance or high memory usage during distillation, consider implementing OPRD's hidden-state alignment. This approach can significantly close the student-teacher performance gap on complex reasoning benchmarks like AIME and AIMO, while also accelerating training by 1.44x and reducing memory consumption by 54%.

Key insights

OPRD improves on-policy distillation by aligning hidden states, eliminating sampling variance and providing richer structural information.

Principles

Hidden-state alignment reduces sampling variance.
Intermediate representations offer richer distillation signals.
Bypassing LM head improves distillation efficiency.

Method

OPRD aligns student and teacher hidden-state representations across selected layers on the same rollouts, entirely bypassing the language model head to eliminate sampling variance.

In practice

Distill large language models more efficiently.
Improve student model performance on reasoning tasks.
Reduce memory footprint during distillation.

Topics

Knowledge Distillation
Representation Learning
Large Language Models
Model Compression
On-Policy Distillation
AIME

Code references

ShenzhiYang2000/OPRD

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.