Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning
Summary
A new approach, Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning, addresses challenges in multi-turn tool-using agents, which often struggle with coordinating long-horizon tool sequences and maintaining dialogue state. The proposed ToolGraph system integrates schema-derived topology, transition weights from successful rollouts, and history-aware controls to improve tool selection. For self-improvement, 161 preference pairs are constructed by identifying divergence points through state-based matching and prefix-based alignment, then filtered by action-correctness annotations. These pairs are used to train a DPO model within the ToolGraph context. Evaluated on 375 tau2-bench tasks, ToolGraph alone increased the weighted average reward from 0.304 to 0.338 (+11.2% relative). When combined with DPO, the system achieved 0.355 (+16.8% over the baseline), with significant gains in airline and retail tasks. Diagnostics revealed that roughly half of telecom trajectories exhausted their step budget, and chosen reward positivity proved the most effective checkpoint signal across 16 DPO configurations.
Key takeaway
For Machine Learning Engineers developing multi-turn tool-calling agents, you should consider integrating structured orchestration like ToolGraph with preference learning. By generating preference pairs from divergence points in agent trajectories and training with DPO, you can significantly improve agent performance, particularly in complex domains like airline and retail tasks. This approach offers a scalable pathway for bootstrapping complex tool-using behaviors without extensive human annotation, enhancing your agent's ability to coordinate long-horizon tool sequences.
Key insights
The paper combines ToolGraph with DPO, using divergence points for preference learning to enhance multi-turn tool-calling agents.
Principles
- Tool selection benefits from structured topology.
- Divergence points offer strong preference signals.
Method
ToolGraph combines schema topology, rollout-estimated transition weights, and history-aware controls. Preference pairs are generated from divergence points, filtered by action correctness, and used to train DPO.
In practice
- Apply ToolGraph for structured tool orchestration.
- Use divergence points to generate DPO preference data.
Topics
- Multi-turn Agents
- Tool-Calling
- Preference Learning
- Direct Preference Optimization
- ToolGraph
- Agent Self-Evolution
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.