Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

2026-06-22 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

A new approach, Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning, addresses challenges in multi-turn tool-using agents, which often struggle with coordinating long-horizon tool sequences and maintaining dialogue state. The proposed ToolGraph system integrates schema-derived topology, transition weights from successful rollouts, and history-aware controls to improve tool selection. For self-improvement, 161 preference pairs are constructed by identifying divergence points through state-based matching and prefix-based alignment, then filtered by action-correctness annotations. These pairs are used to train a DPO model within the ToolGraph context. Evaluated on 375 tau2-bench tasks, ToolGraph alone increased the weighted average reward from 0.304 to 0.338 (+11.2% relative). When combined with DPO, the system achieved 0.355 (+16.8% over the baseline), with significant gains in airline and retail tasks. Diagnostics revealed that roughly half of telecom trajectories exhausted their step budget, and chosen reward positivity proved the most effective checkpoint signal across 16 DPO configurations.

Key takeaway

For Machine Learning Engineers developing multi-turn tool-calling agents, you should consider integrating structured orchestration like ToolGraph with preference learning. By generating preference pairs from divergence points in agent trajectories and training with DPO, you can significantly improve agent performance, particularly in complex domains like airline and retail tasks. This approach offers a scalable pathway for bootstrapping complex tool-using behaviors without extensive human annotation, enhancing your agent's ability to coordinate long-horizon tool sequences.

Key insights

The paper combines ToolGraph with DPO, using divergence points for preference learning to enhance multi-turn tool-calling agents.

Principles

Tool selection benefits from structured topology.
Divergence points offer strong preference signals.

Method

ToolGraph combines schema topology, rollout-estimated transition weights, and history-aware controls. Preference pairs are generated from divergence points, filtered by action correctness, and used to train DPO.

In practice

Apply ToolGraph for structured tool orchestration.
Use divergence points to generate DPO preference data.

Topics

Multi-turn Agents
Tool-Calling
Preference Learning
Direct Preference Optimization
ToolGraph
Agent Self-Evolution

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.