Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Sibling-Guided Credit Distillation (SGCD), developed at Amazon Web Services, addresses the challenge of coarse token-level credit assignment in long-horizon tool-use reinforcement learning. The research identifies that direct self-distillation (SD) can "silently destroy" tool-use, causing models like Qwen3.5-4B to abandon state-changing actions for information-only tasks, as observed on the τ³-airline benchmark where SDPO achieved 0% success on action tasks. SGCD instead uses distillation for credit assignment, employing dynamic sampling of mixed successful and failed sibling rollouts. An external LLM summarizes these into a training-only stepwise credit reference, and dense teacher/student divergence reshapes GRPO token advantages via bounded detached credit weights. SGCD improved AppWorld Task Goal Completion from 42.9→45.6 on test_normal and 24.7→27.0 on test_challenge, and τ³-airline pass@1 from 0.583→0.602. The deployed student agent operates without external LLM or sibling evidence.

Key takeaway

For Machine Learning Engineers developing long-horizon tool-use agents, directly applying self-distillation can inadvertently degrade agent performance by reinforcing non-tool-using shortcuts. You should prioritize policy gradient as the primary update driver, using distillation techniques like Sibling-Guided Credit Distillation (SGCD) to refine token-level credit assignment rather than as a competing actor loss. This approach ensures your agent's learning remains grounded in verified outcomes, preserving and enhancing complex tool-use capabilities.

Key insights

Direct self-distillation can degrade tool-use by misaligning with verifier rewards; distillation should guide credit, not replace policy gradient.

Principles

Method

SGCD dynamically samples mixed sibling rollouts, uses an external LLM for a stepwise credit reference, and applies detached teacher/student divergence to reweight GRPO token advantages.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.