SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

2026-06-17 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

SAGE-OPD is a novel verifier-free selective intervention framework designed for multi-turn On-Policy Distillation (OPD) in LLM agents. It addresses the brittleness of standard dense token-level OPD in multi-turn environments, where early errors can compound and propagate unreliable teacher supervision. Unlike uniform supervision, SAGE-OPD observes environment feedback and uses teacher judgment to selectively intervene on student responses, skipping turns where intervention is not necessary. To mitigate compounding errors, it weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted histories. Additionally, SAGE-OPD applies loss normalization to maintain the overall loss scale while enabling selective turn-level weighting. Experiments on agent tasks demonstrate that SAGE-OPD consistently outperforms baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies confirm the complementary benefits of its core components.

Key takeaway

For Machine Learning Engineers developing multi-turn LLM agents, you should adopt selective intervention strategies for on-policy distillation. This approach, exemplified by SAGE-OPD, mitigates compounding errors and improves agent performance by up to 13.3% in unseen success rates. Consider implementing turn-level intervention, teacher confidence weighting, and loss normalization to enhance the reliability and effectiveness of your agent training.

Key insights

Multi-turn on-policy distillation benefits from selective, confidence-weighted teacher intervention and loss normalization to mitigate compounding errors.

Principles

On-policy distillation requires selective intervention.
Teacher confidence weighting improves supervision reliability.
Loss normalization preserves overall loss scale.

Method

SAGE-OPD observes environment feedback, uses teacher judgment for turn-level intervention, weights token-level distillation by teacher confidence, and applies loss normalization to preserve loss scale.

In practice

Implement turn-level intervention for multi-turn agents.
Weight distillation by teacher confidence.
Apply loss normalization in selective OPD.

Topics

On-Policy Distillation
LLM Agents
Multi-Turn Interaction
Selective Intervention
Teacher Confidence Weighting
Loss Normalization

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.