Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
Summary
Sibling-Guided Credit Distillation (SGCD), developed at Amazon Web Services, addresses the challenge of coarse token-level credit assignment in long-horizon tool-use reinforcement learning. The research identifies that direct self-distillation (SD) can "silently destroy" tool-use, causing models like Qwen3.5-4B to abandon state-changing actions for information-only tasks, as observed on the τ³-airline benchmark where SDPO achieved 0% success on action tasks. SGCD instead uses distillation for credit assignment, employing dynamic sampling of mixed successful and failed sibling rollouts. An external LLM summarizes these into a training-only stepwise credit reference, and dense teacher/student divergence reshapes GRPO token advantages via bounded detached credit weights. SGCD improved AppWorld Task Goal Completion from 42.9→45.6 on test_normal and 24.7→27.0 on test_challenge, and τ³-airline pass@1 from 0.583→0.602. The deployed student agent operates without external LLM or sibling evidence.
Key takeaway
For Machine Learning Engineers developing long-horizon tool-use agents, directly applying self-distillation can inadvertently degrade agent performance by reinforcing non-tool-using shortcuts. You should prioritize policy gradient as the primary update driver, using distillation techniques like Sibling-Guided Credit Distillation (SGCD) to refine token-level credit assignment rather than as a competing actor loss. This approach ensures your agent's learning remains grounded in verified outcomes, preserving and enhancing complex tool-use capabilities.
Key insights
Direct self-distillation can degrade tool-use by misaligning with verifier rewards; distillation should guide credit, not replace policy gradient.
Principles
- Policy gradient must drive model updates.
- Distillation can improve credit assignment.
- Avoid direct teacher imitation losses.
Method
SGCD dynamically samples mixed sibling rollouts, uses an external LLM for a stepwise credit reference, and applies detached teacher/student divergence to reweight GRPO token advantages.
In practice
- Use mixed successful/failed rollouts for credit.
- Employ an external LLM for credit summarization.
- Apply bounded credit weights to policy gradient.
Topics
- Tool-Use Agents
- Policy Gradient
- Credit Assignment
- Self-Distillation
- Reinforcement Learning
- Large Language Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.