SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation
Summary
SAGE-OPD is a novel verifier-free selective intervention framework designed for multi-turn On-Policy Distillation (OPD) in LLM agents. It addresses the brittleness of standard dense token-level OPD in multi-turn environments, where early errors can compound and propagate unreliable teacher supervision. Unlike uniform supervision, SAGE-OPD observes environment feedback and uses teacher judgment to selectively intervene on student responses, skipping turns where intervention is not necessary. To mitigate compounding errors, it weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted histories. Additionally, SAGE-OPD applies loss normalization to maintain the overall loss scale while enabling selective turn-level weighting. Experiments on agent tasks demonstrate that SAGE-OPD consistently outperforms baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies confirm the complementary benefits of its core components.
Key takeaway
For Machine Learning Engineers developing multi-turn LLM agents, you should adopt selective intervention strategies for on-policy distillation. This approach, exemplified by SAGE-OPD, mitigates compounding errors and improves agent performance by up to 13.3% in unseen success rates. Consider implementing turn-level intervention, teacher confidence weighting, and loss normalization to enhance the reliability and effectiveness of your agent training.
Key insights
Multi-turn on-policy distillation benefits from selective, confidence-weighted teacher intervention and loss normalization to mitigate compounding errors.
Principles
- On-policy distillation requires selective intervention.
- Teacher confidence weighting improves supervision reliability.
- Loss normalization preserves overall loss scale.
Method
SAGE-OPD observes environment feedback, uses teacher judgment for turn-level intervention, weights token-level distillation by teacher confidence, and applies loss normalization to preserve loss scale.
In practice
- Implement turn-level intervention for multi-turn agents.
- Weight distillation by teacher confidence.
- Apply loss normalization in selective OPD.
Topics
- On-Policy Distillation
- LLM Agents
- Multi-Turn Interaction
- Selective Intervention
- Teacher Confidence Weighting
- Loss Normalization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.