SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation
Summary
SAGE-OPD is a novel verifier-free selective intervention framework designed for multi-turn on-policy distillation (OPD) in LLM agents, addressing limitations of standard dense token-level OPD. Unlike single-turn settings, multi-turn interactions can lead to compounding errors and unreliable teacher supervision. SAGE-OPD selectively applies teacher supervision by observing environment feedback and using teacher judgment to decide whether to skip or intervene on student responses. It further weights token-level distillation by teacher confidence to mitigate unreliable signals on corrupted histories and employs loss normalization to maintain the overall loss scale. Evaluated on ALFWorld, ScienceWorld, and SearchQA, SAGE-OPD consistently outperforms baselines, demonstrating up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies confirm the individual contributions of its three core components.
Key takeaway
For Machine Learning Engineers developing multi-turn LLM agents, if you are struggling with compounding errors and brittle supervision in on-policy distillation, consider implementing selective teacher intervention. Your approach should incorporate turn-level judgment (Skip, Weak, Strong) and teacher confidence weighting to apply supervision only when necessary and reliable. This strategy, demonstrated by SAGE-OPD, can significantly improve generalization and unseen success rates, such as the 13.3% gain observed on ALFWorld.
Key insights
Multi-turn on-policy distillation benefits from selective, confidence-weighted teacher intervention and loss normalization, improving generalization.
Principles
- Uniform token-level OPD is brittle in multi-turn agents.
- Selective intervention improves generalization over dense supervision.
- Teacher confidence weighting enhances supervision reliability.
Method
SAGE-OPD performs multi-turn on-policy generation, then uses environment feedback and teacher judgment (Skip, Weak, Strong) to determine turn-level intervention. This is combined with teacher confidence weighting and loss normalization for the OPD loss.
In practice
- Implement turn-level intervention via teacher judgment.
- Weight distillation loss by teacher predictive confidence.
- Apply loss normalization to preserve training signal scale.
Topics
- On-Policy Distillation
- Multi-Turn LLM Agents
- Selective Intervention
- Teacher Confidence Weighting
- Loss Normalization
- ALFWorld Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.