On-policy distillation: one of the hottest terms on PapersWithCode [R]
Summary
Niels from Hugging Face highlights On-policy distillation (OPD) as a prominent technique in AI research, featured on PapersWithCode.co. OPD is a key post-training method employed in models such as Qwen 3.6, Qwen 3.7, GLM-5.1, and DeepSeek-V4. The core idea involves a "hint model" identifying specific errors during a model's rollout and inserting hint tokens into the trajectory. This process causes the original model to assign lower probabilities to the error tokens, allowing it to be trained to match these new probabilities and correct mistakes without requiring new rollouts. Niels also details the PapersWithCode initiative, explaining that publications are fetched from Hugging Face's daily submissions, which are upvoted by users. Research relevance is currently determined by GitHub star velocity, with plans to incorporate trending scores of linked models, datasets, and Spaces.
Key takeaway
For Machine Learning Engineers aiming to refine large language models post-training, On-policy distillation (OPD) offers a precise error correction mechanism. You should investigate OPD to address specific rollout mistakes, as it avoids the computational cost of regenerating full trajectories while effectively teaching your model to downweight error probabilities. Consider integrating this technique, especially if your models, like Qwen or GLM, exhibit specific, correctable errors in their generated outputs.
Key insights
On-policy distillation corrects specific model errors by using a hint model to guide probability adjustments without new rollouts.
Principles
- Training on student's own mistakes is crucial.
- Staying on-policy constrains the search space.
- Multiple teachers can reduce bias.
Method
A hint model identifies rollout errors, inserts hint tokens, then the original model is trained to match new probabilities, downweighting specific error tokens.
In practice
- Apply OPD as a post-training technique.
- Use OPD for models like Qwen, GLM, DeepSeek.
- Consider multiple teachers for balanced OPD.
Topics
- On-policy Distillation
- Large Language Models
- Post-training Techniques
- PapersWithCode
- Error Correction
- Knowledge Distillation
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.