The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?

2026-04-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study reveals "tool overuse" in Large Language Models (LLMs), where models unnecessarily invoke external tools even when internal knowledge suffices, leading to avoidable resource consumption and performance degradation. This phenomenon is pervasive across diverse LLMs, including frontier, open-source, and RLVR-trained models, with an average of 0.93 unnecessary tool calls per query and a 3.29% to 14.48% accuracy drop on internally solvable questions. The research identifies two primary mechanisms: a "knowledge epistemic illusion," where LLMs misjudge their internal knowledge boundaries, and an "outcome-only reward trap" in Reinforcement Learning with Verifiable Rewards (RLVR) training, which incentivizes final correctness over tool efficiency. Mitigation strategies, including a knowledge-aware direct preference optimization (K-DPO) and a balanced outcome-efficiency reward, significantly reduce tool calls by 82.8% and 66.7% (7B model) respectively, while improving or maintaining accuracy.

Key takeaway

For NLP Engineers and Research Scientists developing tool-augmented LLMs, understanding and mitigating tool overuse is critical. Your models may be inefficiently calling external tools due to miscalibrated internal knowledge and reward structures that prioritize correctness over efficiency. Implement knowledge-aware optimization and balanced reward functions to reduce unnecessary tool invocations, thereby improving performance and resource efficiency without sacrificing accuracy.

Key insights

LLMs frequently overuse external tools due to misjudging internal knowledge and outcome-only training rewards.

Principles

Higher internal knowledge does not correlate with fewer tool calls.
Outcome-only rewards reinforce tool overuse.
Tool invocation is rational if marginal reliability gain exceeds marginal cost.

Method

Knowledge-aware Direct Preference Optimization (K-DPO) aligns perceived knowledge with actual capacity. A balanced outcome-efficiency reward penalizes tool calls during RLVR training.

In practice

Implement K-DPO to reduce unnecessary tool calls.
Integrate tool efficiency penalties into RLVR reward functions.
Measure internal knowledge availability using avg@1024.

Topics

Tool Overuse
Knowledge Epistemic Illusion
Outcome-Only Rewards
Tool-Integrated Reasoning
Direct Preference Optimization

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.