DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
Summary
Dive is an evidence-driven framework designed to enhance the generalization of tool-using Large Language Models (LLMs) by synthesizing diverse, verifiable, and executable agentic tasks. It addresses the brittleness of current LLMs under shifts in tasks and toolsets, which is attributed to insufficient diversity in synthesized training data. Dive inverts the traditional synthesis order by first executing diverse, real-world tools and then reverse-deriving tasks strictly entailed by the resulting traces, ensuring grounding by construction. The framework scales structural diversity along two axes: tool-pool coverage (373 tools across five domains) and per-task toolset variety, inducing rich multi-step tool-use patterns. Training Qwen3-8B on Dive data (48k SFT + 3.2k RL) resulted in a +22 average point improvement across 9 out-of-distribution (OOD) benchmarks, outperforming the strongest 8B baseline by +68%. Notably, diversity scaling consistently outperformed quantity scaling for OOD generalization, even with 4x less data.
Key takeaway
For research scientists developing tool-using LLMs, you should prioritize synthesizing diverse training data by expanding tool-pool coverage and toolset variety, rather than merely increasing data quantity. Adopt an evidence-first synthesis approach to ensure tasks are verifiable and executable, which is crucial for robust generalization across varied real-world tasks and toolsets. This strategy, combined with reinforcement learning, will yield more adaptable and reliable agentic models.
Key insights
Scaling diversity in agentic task synthesis through evidence-first execution significantly improves LLM generalization for tool use.
Principles
- Diversity scaling outperforms quantity scaling for OOD generalization.
- Grounded validity and structural diversity are critical for agentic task synthesis.
- RL amplifies generalization by exploring diverse tool-use structures.
Method
Dive inverts task synthesis by executing real-world tools first to collect grounded evidence, then reverse-deriving verifiable query-answer pairs. This iterative process expands tool-pool coverage and toolset variety, inducing complex tool-use patterns.
In practice
- Prioritize tool-pool expansion over data quantity for OOD generalization.
- Use evidence-first synthesis to ensure task verifiability and executability.
- Combine SFT with RL to reinforce diverse tool-use patterns.
Topics
- Agentic Task Synthesis
- Tool-Using LLMs
- Out-of-Distribution Generalization
- Supervised Fine-Tuning
- Reinforcement Learning
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.