DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

2026-05-28 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DeepTool is a novel framework designed to scale deliberate thinking in Tool-Integrated Reasoning (TIR), enhancing Large Language Model (LLM) capabilities by improving sequential tool invocation. Existing TIR methods often lack the strategic planning and self-correction needed for complex tasks, as conventional reinforcement learning (RL) struggles with sparse outcome-based rewards for intermediate steps. DeepTool introduces a synthesis pipeline that evolves extended thinking into interleaved trajectories, incorporating adversarial perturbations for robustness. It also employs Process-Supervised Reinforcement Learning based on GRPO, utilizing an Action-Centric Process Reward to reinforce intermediate thinking and ensure precise tool invocation. Experiments show DeepTool significantly boosts Qwen2.5-7B performance across six benchmarks, including AIME24 from 3.2% to 40.4% and HMMT25 from 0.0% to 28.6%, while maintaining token cost-effectiveness.

Key takeaway

For Machine Learning Engineers developing LLM-powered agents that require complex, multi-step tool use, DeepTool offers a significant performance uplift. You should consider integrating process-supervised reinforcement learning to enhance intermediate reasoning and tool invocation precision. This approach can boost models like Qwen2.5-7B on challenging benchmarks, improving both accuracy and token efficiency in your applications.

Key insights

DeepTool uses process-supervised reinforcement learning and interleaved deliberation to significantly improve LLM tool-integrated reasoning performance and efficiency.

Principles

Deliberate thinking enhances sequential tool use.
Process-supervision improves intermediate step learning.
Adversarial perturbations build robustness.

Method

DeepTool employs a synthesis pipeline for interleaved trajectories with adversarial perturbations, then uses GRPO-based Process-Supervised Reinforcement Learning with an Action-Centric Process Reward for precise tool invocation.

In practice

Boost Qwen2.5-7B performance on math benchmarks.
Improve LLM strategic planning with tools.
Optimize token efficiency in tool use.

Topics

Tool-Integrated Reasoning
Process-Supervised RL
Large Language Models
Qwen2.5-7B
Deliberative AI
Token Efficiency

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.