PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

PowerOPD introduces a novel approach to stabilize on-policy distillation (OPD) for large language models, addressing critical issues found in standard OPD. Traditional OPD, which estimates the reverse-KL objective using student-sampled tokens, suffers from sample inefficiency, unstable generation dynamics, and a significant performance gap compared to exact full-vocabulary OPD. These pathologies stem from the unbounded log-ratio reward, which generates extremely high-variance gradients concentrated at early positions. PowerOPD resolves this by proposing a family of natively bounded, sign-consistent rewards based on the Box-Cox power transformation, parameterized by alpha > 0. Evaluated across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieved benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD. Furthermore, it reduced wall-clock time by 59.2% and peak GPU memory by 23.1%, with larger alpha values leading to improved accuracy and significantly smaller gradient norms.

Key takeaway

For Machine Learning Engineers and AI Scientists optimizing large language model distillation, PowerOPD offers a critical solution to training instability and performance gaps. By replacing unbounded log-ratio rewards with natively bounded, power-transformed rewards, you can achieve substantial gains in accuracy and efficiency. You should consider integrating PowerOPD into your on-policy distillation pipelines to stabilize training, reduce wall-clock time by 59.2%, and decrease peak GPU memory by 23.1%, especially when working with Qwen3 or similar teacher-student pairs.

Key insights

PowerOPD stabilizes on-policy distillation by using bounded power-transformed rewards, significantly improving performance and efficiency.

Principles

Unbounded log-ratio rewards cause high-variance gradients in OPD.
Bounded rewards from Box-Cox transformation stabilize training.
Larger alpha in PowerOPD improves accuracy and shortens responses.

Method

PowerOPD applies a Box-Cox power transformation, parameterized by alpha > 0, to the log-ratio reward, creating natively bounded, sign-consistent rewards to stabilize on-policy distillation.

In practice

Implement PowerOPD with alpha > 0 for stable LLM distillation.
Consider larger alpha values for improved accuracy and shorter outputs.
Reduce GPU memory and wall-clock time in OPD training.

Topics

On-Policy Distillation
Large Language Models
Reward Transformation
Box-Cox Transformation
Mathematical Reasoning
Qwen3

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.