DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning
Summary
DeepTool is a novel framework designed to scale deliberate thinking in Tool-Integrated Reasoning (TIR), enhancing Large Language Model (LLM) capabilities by improving sequential tool invocation. Existing TIR methods often lack the strategic planning and self-correction needed for complex tasks, as conventional reinforcement learning (RL) struggles with sparse outcome-based rewards for intermediate steps. DeepTool introduces a synthesis pipeline that evolves extended thinking into interleaved trajectories, incorporating adversarial perturbations for robustness. It also employs Process-Supervised Reinforcement Learning based on GRPO, utilizing an Action-Centric Process Reward to reinforce intermediate thinking and ensure precise tool invocation. Experiments show DeepTool significantly boosts Qwen2.5-7B performance across six benchmarks, including AIME24 from 3.2% to 40.4% and HMMT25 from 0.0% to 28.6%, while maintaining token cost-effectiveness.
Key takeaway
For Machine Learning Engineers developing LLM-powered agents that require complex, multi-step tool use, DeepTool offers a significant performance uplift. You should consider integrating process-supervised reinforcement learning to enhance intermediate reasoning and tool invocation precision. This approach can boost models like Qwen2.5-7B on challenging benchmarks, improving both accuracy and token efficiency in your applications.
Key insights
DeepTool uses process-supervised reinforcement learning and interleaved deliberation to significantly improve LLM tool-integrated reasoning performance and efficiency.
Principles
- Deliberate thinking enhances sequential tool use.
- Process-supervision improves intermediate step learning.
- Adversarial perturbations build robustness.
Method
DeepTool employs a synthesis pipeline for interleaved trajectories with adversarial perturbations, then uses GRPO-based Process-Supervised Reinforcement Learning with an Action-Centric Process Reward for precise tool invocation.
In practice
- Boost Qwen2.5-7B performance on math benchmarks.
- Improve LLM strategic planning with tools.
- Optimize token efficiency in tool use.
Topics
- Tool-Integrated Reasoning
- Process-Supervised RL
- Large Language Models
- Qwen2.5-7B
- Deliberative AI
- Token Efficiency
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.