OPD+: Rethinking the Advantage Design for On-Policy Distillation
Summary
OPD+ is a new approach to On-Policy Distillation (OPD), a technique for transferring capabilities from large teacher language models to smaller student models using a reinforcement learning-style objective. The paper identifies a critical flaw in existing OPD methods: the common stop-gradient design, adopted for stability, leads to biased estimates of the reward objective and corresponding gradients when using general divergence functions. Researchers developed a generic optimization framework based on f-divergence to mathematically revisit this design. They prove that stop-gradient operations yield biased estimates. OPD+ corrects this issue, demonstrating improved performance compared to the baseline KL approach and offering support for various f-divergence choices. Its effectiveness was validated on mathematical reasoning and tool-use benchmarks.
Key takeaway
For Machine Learning Engineers optimizing student language models via on-policy distillation, you should re-evaluate the common practice of using stop-gradients. This design introduces significant bias into reward and gradient estimates, potentially hindering performance. Instead, implement OPD+ for more accurate distillation, especially for mathematical reasoning or tool-use tasks, leveraging its f-divergence-based framework.
Key insights
Stop-gradient in on-policy distillation biases reward estimates; OPD+ offers a corrected framework for improved teacher-student model transfer.
Principles
- Stop-gradients bias RL reward estimates.
- f-divergence offers a robust optimization basis.
- Correcting bias improves model distillation.
Method
OPD+ proposes a generic f-divergence-based optimization framework to correct biased reward and gradient estimates from stop-gradient operations in on-policy distillation, supporting various f-divergence choices.
In practice
- Apply OPD+ for mathematical reasoning tasks.
- Use OPD+ for tool-use capability transfer.
- Explore various f-divergence functions.
Topics
- On-Policy Distillation
- Language Model Distillation
- Reinforcement Learning
- f-Divergence
- Gradient Bias
- Mathematical Reasoning
- Tool-Use
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.