OPRD: On-Policy Representation Distillation

2026-05-31 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

On-Policy Representation Distillation (OPRD) is introduced as a novel method to overcome limitations in existing on-policy distillation (OPD) for large language models. Current OPD variants supervise students solely in the output space, leading to high sampling variance and an information bottleneck from the LM head. OPRD shifts supervision to the hidden-state space, aligning student and teacher intermediate representations across selected layers and response positions on the same on-policy rollouts. This provides dense, deterministic supervision, eliminating sampling variance and exposing richer structural information. Empirically, OPRD closes the student–teacher gap on AIME 2024, AIME 2025, and AIMO mathematics benchmarks, while training 1.44x faster and using up to 54% less actor-update transient memory than top-k OPD on an 8x A100 (80G) GPU setup.

Key takeaway

For machine learning engineers optimizing LLM post-training, OPRD offers a superior distillation approach. You should consider integrating OPRD into your pipelines, either standalone or composably with existing OPD, to achieve higher accuracy, faster training (1.44x speed-up), and significantly reduced memory footprint (up to 54% less). This method effectively closes the student–teacher performance gap on complex reasoning tasks by providing a more stable and informative training signal.

Key insights

OPRD lifts on-policy distillation to hidden-state space, providing deterministic, richer supervision and eliminating output-space limitations.

Principles

Output-space distillation suffers from high sampling variance.
LM-head projection creates an information bottleneck.
Hidden-state alignment offers richer, deterministic supervision.

Method

OPRD aligns student's intermediate hidden representations with the teacher's across selected transformer layers and response positions using a normalized mean-squared error objective on on-policy rollouts.

In practice

Integrate OPRD for multi-model RL merging to reduce memory.
Apply OPRD in on-policy self-distillation for lower variance.
Combine OPRD with existing OPD objectives for additive gains.

Topics

On-Policy Distillation
Representation Distillation
Large Language Models
Hidden States
Mathematical Reasoning
Model Compression

Code references

ShenzhiYang2000/OPRD

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.