On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Guided On-Policy Distillation (Guided-OPD) is a new algorithm designed to improve the transfer of capabilities from large, multi-turn agent models to smaller student models, addressing a key failure mode in traditional On-Policy Distillation (OPD). OPD often suffers from compounding student errors that push trajectories away from the teacher's familiar state distribution, reducing supervision reliability. Guided-OPD mitigates this by mixing teacher- and student-generated turns within each rollout, employing a curriculum that gradually reduces the teacher's intervention probability to zero. This approach ensures strong guidance in early trajectories, keeping them close to the teacher's distribution, before transitioning to a purely on-policy regime. Benchmarked on ALFWorld, ScienceWorld, and WebShop, Guided-OPD, when distilling Qwen3 students from a Qwen3-30B-A3B teacher, achieved an average 21.1% improvement in Score and a 25.5% increase in Success Rate compared to vanilla OPD, with greater benefits for smaller student models.

Key takeaway

For Machine Learning Engineers deploying multi-turn agents, Guided-OPD offers a robust solution to the high inference costs of large models. If you are struggling with student model performance degradation due to compounding errors during distillation, consider implementing this curriculum-based approach. It significantly improves score and success rates, especially for smaller student models, making efficient, capable multi-turn agents more feasible for your applications.

Key insights

Guided On-Policy Distillation improves multi-turn agent capability transfer by curriculum-based teacher intervention, preventing student error compounding.

Principles

Method

Guided-OPD mixes teacher- and student-generated turns in rollouts. It schedules teacher intervention probability along a curriculum that decays to zero, ensuring early strong guidance and later on-policy inference.

In practice

Topics

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.