Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Sibling-Guided Credit Distillation (SGCD), developed at Amazon Web Services, addresses the challenge of coarse token-level credit assignment in long-horizon tool-use reinforcement learning. The research identifies that direct self-distillation (SD) can "silently destroy" tool-use, causing models like Qwen3.5-4B to abandon state-changing actions for information-only tasks, as observed on the τ³-airline benchmark where SDPO achieved 0% success on action tasks. SGCD instead uses distillation for credit assignment, employing dynamic sampling of mixed successful and failed sibling rollouts. An external LLM summarizes these into a training-only stepwise credit reference, and dense teacher/student divergence reshapes GRPO token advantages via bounded detached credit weights. SGCD improved AppWorld Task Goal Completion from 42.9→45.6 on test_normal and 24.7→27.0 on test_challenge, and τ³-airline pass@1 from 0.583→0.602. The deployed student agent operates without external LLM or sibling evidence.

Key takeaway

For Machine Learning Engineers developing long-horizon tool-use agents, directly applying self-distillation can inadvertently degrade agent performance by reinforcing non-tool-using shortcuts. You should prioritize policy gradient as the primary update driver, using distillation techniques like Sibling-Guided Credit Distillation (SGCD) to refine token-level credit assignment rather than as a competing actor loss. This approach ensures your agent's learning remains grounded in verified outcomes, preserving and enhancing complex tool-use capabilities.

Key insights

Direct self-distillation can degrade tool-use by misaligning with verifier rewards; distillation should guide credit, not replace policy gradient.

Principles

Policy gradient must drive model updates.
Distillation can improve credit assignment.
Avoid direct teacher imitation losses.

Method

SGCD dynamically samples mixed sibling rollouts, uses an external LLM for a stepwise credit reference, and applies detached teacher/student divergence to reweight GRPO token advantages.

In practice

Use mixed successful/failed rollouts for credit.
Employ an external LLM for credit summarization.
Apply bounded credit weights to policy gradient.

Topics

Tool-Use Agents
Policy Gradient
Credit Assignment
Self-Distillation
Reinforcement Learning
Large Language Models

Code references

sierra-research/tau2-bench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.