The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?
Summary
A study reveals "tool overuse" in Large Language Models (LLMs), where models unnecessarily invoke external tools even when internal knowledge suffices, leading to avoidable resource consumption and performance degradation. This phenomenon is pervasive across diverse LLMs, including frontier, open-source, and RLVR-trained models, with an average of 0.93 unnecessary tool calls per query and a 3.29% to 14.48% accuracy drop on internally solvable questions. The research identifies two primary mechanisms: a "knowledge epistemic illusion," where LLMs misjudge their internal knowledge boundaries, and an "outcome-only reward trap" in Reinforcement Learning with Verifiable Rewards (RLVR) training, which incentivizes final correctness over tool efficiency. Mitigation strategies, including a knowledge-aware direct preference optimization (K-DPO) and a balanced outcome-efficiency reward, significantly reduce tool calls by 82.8% and 66.7% (7B model) respectively, while improving or maintaining accuracy.
Key takeaway
For NLP Engineers and Research Scientists developing tool-augmented LLMs, understanding and mitigating tool overuse is critical. Your models may be inefficiently calling external tools due to miscalibrated internal knowledge and reward structures that prioritize correctness over efficiency. Implement knowledge-aware optimization and balanced reward functions to reduce unnecessary tool invocations, thereby improving performance and resource efficiency without sacrificing accuracy.
Key insights
LLMs frequently overuse external tools due to misjudging internal knowledge and outcome-only training rewards.
Principles
- Higher internal knowledge does not correlate with fewer tool calls.
- Outcome-only rewards reinforce tool overuse.
- Tool invocation is rational if marginal reliability gain exceeds marginal cost.
Method
Knowledge-aware Direct Preference Optimization (K-DPO) aligns perceived knowledge with actual capacity. A balanced outcome-efficiency reward penalizes tool calls during RLVR training.
In practice
- Implement K-DPO to reduce unnecessary tool calls.
- Integrate tool efficiency penalties into RLVR reward functions.
- Measure internal knowledge availability using avg@1024.
Topics
- Tool Overuse
- Knowledge Epistemic Illusion
- Outcome-Only Rewards
- Tool-Integrated Reasoning
- Direct Preference Optimization
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.