Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents
Summary
LLM agents frequently mis-call tools, a phenomenon often attributed to models failing to identify the correct tool within a crowded selection. However, new research demonstrates the opposite: models attend most to the correct tool 80% of the time (versus a 21% chance baseline) on real BFCL failures, with the gold tool being under-attended in only 10% of cases. This refutes the "crowded-harness" theory, pinpointing the failure at the decision readout. Evidence includes prompt repairs recovering <=23% of failures, while readout-side interventions recover 59-91%. Furthermore, two gold-pointed interventions in different representations recover largely the same failures (Jaccard 0.865 pooled). A training-free selector, using per-segment attention, closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts) and adds +14.9 pts on Seal-Tools. The causal attention-bias dose-response was tested on 10 models (3-32B), and the deployable selector on 5 single-turn models.
Key takeaway
For AI Engineers debugging LLM agent tool-selection failures, shift your focus from prompt engineering to readout mechanisms. The research indicates that models often "see" the correct tool, but the decision-making process is flawed. Prioritize readout-side interventions, which recover 59-91% of failures, over prompt reordering or duplication, which yield only <=23% recovery. Consider implementing attention-based selectors to significantly improve tool-calling accuracy.
Key insights
LLM agent tool-selection failures stem from decision readout, not insufficient attention to the correct tool.
Principles
- LLM agents often attend to the correct tool but fail at decision readout.
- Readout-side interventions are more effective than prompt repairs.
- Attention-logit bias and residual-stream steering recover similar failures.
Method
A training-free, gold-free selector uses per-segment attention to improve tool selection, closing most of the gold-free-vs-oracle gap.
In practice
- Implement readout-side interventions for LLM tool-selection issues.
- Utilize attention-based selectors to enhance tool-calling accuracy.
Topics
- LLM Agents
- Tool Selection
- Attention Mechanisms
- Decision Readout
- Model Debugging
- BFCL Benchmark
- Seal-Tools
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.