ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ToolGrad is an agentic framework that efficiently generates high-quality tool-use datasets for Large Language Models (LLMs) by inverting the traditional "query-first" paradigm. Instead, ToolGrad employs an "answer-first" approach, iteratively constructing valid tool-use chains guided by textual "gradients" before synthesizing corresponding user queries. This method addresses the inefficiencies and annotation failures of prior approaches like DFS. Using this framework, the researchers created ToolGrad-5K, a dataset of 5,000 samples featuring more complex tool usage, generated at a lower cost, and achieving a 100% pass rate. Experiments demonstrate that models like Gemma-3 and Llama-3, fine-tuned on ToolGrad-5K, outperform those trained on expensive baseline datasets such as ToolBench and even proprietary LLMs, exhibiting superior tool recall, success rates, and Quality of Response, even on out-of-distribution benchmarks.

Key takeaway

For Machine Learning Engineers developing tool-use LLMs, consider integrating ToolGrad-5K into your training pipeline. This dataset, generated with a 100% pass rate and lower cost, enables smaller models like Llama-3 to achieve superior tool recall and response quality compared to models trained on traditional, failure-prone datasets. Leveraging ToolGrad's "answer-first" approach can significantly reduce data generation expenses and improve model performance, even on out-of-distribution tasks, offering a compelling alternative to costly proprietary LLMs.

Key insights

ToolGrad inverts tool-use dataset generation, creating valid API chains first, then user queries, ensuring 100% success and lower cost.

Principles

"Answer-first" data generation improves efficiency and guarantees solvability.
Textual "gradients" can guide iterative API chain construction.
High-quality synthetic data can enable smaller LLMs to outperform larger proprietary models.

Method

ToolGrad iteratively chains APIs using four LLM-powered modules: API Proposer (suggests APIs), API Executor (runs them), API Selector (chooses best API via textual "gradients"), and Workflow Updater (integrates selected API).

In practice

Fine-tune smaller LLMs on ToolGrad-5K for superior tool-use capabilities.
Adopt the "answer-first" paradigm for synthetic dataset creation to ensure data quality.
Utilize textual "gradients" for iterative optimization in agentic frameworks.

Topics

Tool-use LLMs
Dataset Generation
Textual Gradients
API Chaining
Synthetic Data
LLM Fine-tuning
Out-of-Distribution Evaluation

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.