On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents
Summary
Guided On-Policy Distillation (Guided-OPD) is a new algorithm designed to improve the transfer of capabilities from large, multi-turn agent models to smaller student models, addressing a key failure mode in traditional On-Policy Distillation (OPD). OPD often suffers from compounding student errors that push trajectories away from the teacher's familiar state distribution, reducing supervision reliability. Guided-OPD mitigates this by mixing teacher- and student-generated turns within each rollout, employing a curriculum that gradually reduces the teacher's intervention probability to zero. This approach ensures strong guidance in early trajectories, keeping them close to the teacher's distribution, before transitioning to a purely on-policy regime. Benchmarked on ALFWorld, ScienceWorld, and WebShop, Guided-OPD, when distilling Qwen3 students from a Qwen3-30B-A3B teacher, achieved an average 21.1% improvement in Score and a 25.5% increase in Success Rate compared to vanilla OPD, with greater benefits for smaller student models.
Key takeaway
For Machine Learning Engineers deploying multi-turn agents, Guided-OPD offers a robust solution to the high inference costs of large models. If you are struggling with student model performance degradation due to compounding errors during distillation, consider implementing this curriculum-based approach. It significantly improves score and success rates, especially for smaller student models, making efficient, capable multi-turn agents more feasible for your applications.
Key insights
Guided On-Policy Distillation improves multi-turn agent capability transfer by curriculum-based teacher intervention, preventing student error compounding.
Principles
- Student errors compound in sequential tasks.
- Early guidance stabilizes learning trajectories.
- Gradual withdrawal of guidance recovers on-policy behavior.
Method
Guided-OPD mixes teacher- and student-generated turns in rollouts. It schedules teacher intervention probability along a curriculum that decays to zero, ensuring early strong guidance and later on-policy inference.
In practice
- Distill large multi-turn agents to smaller models.
- Improve student agent performance on complex tasks.
- Apply curriculum learning to agent distillation.
Topics
- Multi-turn Agents
- On-Policy Distillation
- Curriculum Learning
- Agent Guidance
- Model Distillation
- Qwen3
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.