SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

SAGE-OPD is a novel verifier-free selective intervention framework designed for multi-turn on-policy distillation (OPD) in LLM agents, addressing limitations of standard dense token-level OPD. Unlike single-turn settings, multi-turn interactions can lead to compounding errors and unreliable teacher supervision. SAGE-OPD selectively applies teacher supervision by observing environment feedback and using teacher judgment to decide whether to skip or intervene on student responses. It further weights token-level distillation by teacher confidence to mitigate unreliable signals on corrupted histories and employs loss normalization to maintain the overall loss scale. Evaluated on ALFWorld, ScienceWorld, and SearchQA, SAGE-OPD consistently outperforms baselines, demonstrating up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies confirm the individual contributions of its three core components.

Key takeaway

For Machine Learning Engineers developing multi-turn LLM agents, if you are struggling with compounding errors and brittle supervision in on-policy distillation, consider implementing selective teacher intervention. Your approach should incorporate turn-level judgment (Skip, Weak, Strong) and teacher confidence weighting to apply supervision only when necessary and reliable. This strategy, demonstrated by SAGE-OPD, can significantly improve generalization and unseen success rates, such as the 13.3% gain observed on ALFWorld.

Key insights

Multi-turn on-policy distillation benefits from selective, confidence-weighted teacher intervention and loss normalization, improving generalization.

Principles

Uniform token-level OPD is brittle in multi-turn agents.
Selective intervention improves generalization over dense supervision.
Teacher confidence weighting enhances supervision reliability.

Method

SAGE-OPD performs multi-turn on-policy generation, then uses environment feedback and teacher judgment (Skip, Weak, Strong) to determine turn-level intervention. This is combined with teacher confidence weighting and loss normalization for the OPD loss.

In practice

Implement turn-level intervention via teacher judgment.
Weight distillation loss by teacher predictive confidence.
Apply loss normalization to preserve training signal scale.

Topics

On-Policy Distillation
Multi-Turn LLM Agents
Selective Intervention
Teacher Confidence Weighting
Loss Normalization
ALFWorld Benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.