Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

T-STAR (Tree-structured Self-Taught Agent Rectification) is a novel framework designed to improve reinforcement learning for Large Language Model agents, particularly in multi-step reasoning tasks with sparse rewards. It addresses limitations of existing methods like Group Relative Policy Optimization (GRPO) by recovering latent correlated reward structures across trajectories. T-STAR consolidates sampled trajectories into a unified Cognitive Tree, merging functionally similar steps to enable Introspective Valuation, which back-propagates trajectory-level rewards for variance-reduced relative advantage at the step level. It also introduces In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points. This framework then uses Surgical Policy Optimization with a Bradley-Terry type of surgical loss, focusing policy gradient information at these critical points. Experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate T-STAR's consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

Key takeaway

For research scientists optimizing Large Language Model agents in multi-turn reasoning tasks, you should consider implementing T-STAR's tree-structured approach. This method significantly enhances policy optimization by reducing gradient variance and providing targeted step-level supervision, especially for tasks with long reasoning chains. Adopting T-STAR can lead to more stable learning and improved performance without requiring additional reward models or rollouts, making your agent training more efficient and effective.

Key insights

T-STAR improves LLM agent RL by structuring trajectories into a Cognitive Tree for variance reduction and self-rectification.

Principles

Method

T-STAR constructs a Cognitive Tree by merging functionally and historically compatible nodes, computes Q-tree values for variance-reduced advantages, identifies divergence points for In-Context Thought Grafting, and applies Surgical Policy Optimization with a Bradley-Terry loss.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.