DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

2026-01-29 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Dive is an evidence-driven framework designed to enhance the generalization of tool-using Large Language Models (LLMs) by synthesizing diverse, verifiable, and executable agentic tasks. It addresses the brittleness of current LLMs under shifts in tasks and toolsets, which is attributed to insufficient diversity in synthesized training data. Dive inverts the traditional synthesis order by first executing diverse, real-world tools and then reverse-deriving tasks strictly entailed by the resulting traces, ensuring grounding by construction. The framework scales structural diversity along two axes: tool-pool coverage (373 tools across five domains) and per-task toolset variety, inducing rich multi-step tool-use patterns. Training Qwen3-8B on Dive data (48k SFT + 3.2k RL) resulted in a +22 average point improvement across 9 out-of-distribution (OOD) benchmarks, outperforming the strongest 8B baseline by +68%. Notably, diversity scaling consistently outperformed quantity scaling for OOD generalization, even with 4x less data.

Key takeaway

For research scientists developing tool-using LLMs, you should prioritize synthesizing diverse training data by expanding tool-pool coverage and toolset variety, rather than merely increasing data quantity. Adopt an evidence-first synthesis approach to ensure tasks are verifiable and executable, which is crucial for robust generalization across varied real-world tasks and toolsets. This strategy, combined with reinforcement learning, will yield more adaptable and reliable agentic models.

Key insights

Scaling diversity in agentic task synthesis through evidence-first execution significantly improves LLM generalization for tool use.

Principles

Diversity scaling outperforms quantity scaling for OOD generalization.
Grounded validity and structural diversity are critical for agentic task synthesis.
RL amplifies generalization by exploring diverse tool-use structures.

Method

Dive inverts task synthesis by executing real-world tools first to collect grounded evidence, then reverse-deriving verifiable query-answer pairs. This iterative process expands tool-pool coverage and toolset variety, inducing complex tool-use patterns.

In practice

Prioritize tool-pool expansion over data quantity for OOD generalization.
Use evidence-first synthesis to ensure task verifiability and executability.
Combine SFT with RL to reinforce diverse tool-use patterns.

Topics

Agentic Task Synthesis
Tool-Using LLMs
Out-of-Distribution Generalization
Supervised Fine-Tuning
Reinforcement Learning

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.