PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
Summary
PowerOPD introduces a novel approach to stabilize on-policy distillation (OPD) for large language models, addressing critical issues found in standard OPD. Traditional OPD, which estimates the reverse-KL objective using student-sampled tokens, suffers from sample inefficiency, unstable generation dynamics, and a significant performance gap compared to exact full-vocabulary OPD. These pathologies stem from the unbounded log-ratio reward, which generates extremely high-variance gradients concentrated at early positions. PowerOPD resolves this by proposing a family of natively bounded, sign-consistent rewards based on the Box-Cox power transformation, parameterized by alpha > 0. Evaluated across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieved benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD. Furthermore, it reduced wall-clock time by 59.2% and peak GPU memory by 23.1%, with larger alpha values leading to improved accuracy and significantly smaller gradient norms.
Key takeaway
For Machine Learning Engineers and AI Scientists optimizing large language model distillation, PowerOPD offers a critical solution to training instability and performance gaps. By replacing unbounded log-ratio rewards with natively bounded, power-transformed rewards, you can achieve substantial gains in accuracy and efficiency. You should consider integrating PowerOPD into your on-policy distillation pipelines to stabilize training, reduce wall-clock time by 59.2%, and decrease peak GPU memory by 23.1%, especially when working with Qwen3 or similar teacher-student pairs.
Key insights
PowerOPD stabilizes on-policy distillation by using bounded power-transformed rewards, significantly improving performance and efficiency.
Principles
- Unbounded log-ratio rewards cause high-variance gradients in OPD.
- Bounded rewards from Box-Cox transformation stabilize training.
- Larger alpha in PowerOPD improves accuracy and shortens responses.
Method
PowerOPD applies a Box-Cox power transformation, parameterized by alpha > 0, to the log-ratio reward, creating natively bounded, sign-consistent rewards to stabilize on-policy distillation.
In practice
- Implement PowerOPD with alpha > 0 for stable LLM distillation.
- Consider larger alpha values for improved accuracy and shorter outputs.
- Reduce GPU memory and wall-clock time in OPD training.
Topics
- On-Policy Distillation
- Large Language Models
- Reward Transformation
- Box-Cox Transformation
- Mathematical Reasoning
- Qwen3
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.