What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

2026-05-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

SERL, a new selective environment-reweighted learning framework, addresses the credit assignment bottleneck in training multi-turn LLM agents using reinforcement learning. This framework systematically studies five feedback sources and two insertion granularities, revealing that grounded, action-relevant signals at semantically meaningful points are most effective. SERL achieves 90.0% success on ALFWorld and 80.1% on WebShop, outperforming existing RL and distillation baselines. Its core principle is that task reward determines the update direction, while environment feedback selectively adjusts the placement and magnitude of that update, preventing instability from unconstrained distillation. The framework restricts reweighting to executable action tokens and decays the teacher signal over training to mitigate privileged information leakage.

Key takeaway

For Machine Learning Engineers developing multi-turn LLM agents and struggling with sparse rewards or credit assignment, SERL offers a robust approach to integrate environment feedback. You should consider implementing its selective reweighting mechanism, focusing on anchor-level feedback for meaningful state changes and decaying the teacher signal to prevent privileged information leakage. This strategy can significantly accelerate convergence and improve task success rates on complex long-horizon tasks.

Key insights

SERL selectively reweights RL objectives using environment feedback to improve credit assignment in multi-turn LLM agents.

Principles

Task reward dictates update direction; feedback adjusts placement/magnitude.
Effective feedback is grounded, action-relevant, and semantically placed.
Decay hindsight signal over training to prevent bias.

Method

SERL uses an environment-conditioned teacher to score student actions with hindsight, converting the log-probability gap into a bounded, sign-aware reweight of the GRPO advantage, restricted to executable action tokens.

In practice

Apply anchor-level feedback for faster convergence.
Restrict distillation to executable action spans.
Decay teacher signal to mitigate privileged information leakage.

Topics

Multi-turn LLM Agents
Reinforcement Learning
Credit Assignment
Distillation
ALFWorld
WebShop
SERL

Code references

OliverLeeXZ/SERL

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.