OPRD: On-Policy Representation Distillation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

On-Policy Representation Distillation (OPRD) is introduced as a novel method to overcome limitations in existing on-policy distillation (OPD) for large language models. Current OPD variants supervise students solely in the output space, leading to high sampling variance and an information bottleneck from the LM head. OPRD shifts supervision to the hidden-state space, aligning student and teacher intermediate representations across selected layers and response positions on the same on-policy rollouts. This provides dense, deterministic supervision, eliminating sampling variance and exposing richer structural information. Empirically, OPRD closes the student–teacher gap on AIME 2024, AIME 2025, and AIMO mathematics benchmarks, while training 1.44x faster and using up to 54% less actor-update transient memory than top-k OPD on an 8x A100 (80G) GPU setup.

Key takeaway

For machine learning engineers optimizing LLM post-training, OPRD offers a superior distillation approach. You should consider integrating OPRD into your pipelines, either standalone or composably with existing OPD, to achieve higher accuracy, faster training (1.44x speed-up), and significantly reduced memory footprint (up to 54% less). This method effectively closes the student–teacher performance gap on complex reasoning tasks by providing a more stable and informative training signal.

Key insights

OPRD lifts on-policy distillation to hidden-state space, providing deterministic, richer supervision and eliminating output-space limitations.

Principles

Method

OPRD aligns student's intermediate hidden representations with the teacher's across selected transformer layers and response positions using a normalized mean-squared error objective on on-policy rollouts.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.