ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

2026-05-23 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ToolSense, an open-source diagnostic framework, evaluates Large Language Models' (LLMs) parametric tool knowledge, addressing the limitation of existing benchmarks that may not reveal true tool understanding. It automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with varied ambiguity, an MCQ probing benchmark, and a QA probing benchmark. Applied to ToolBench's ~47k tools and five parametric model configurations, ToolSense uncovered a "knowledge-retrieval dissociation." Models, despite strong performance on verbose, fully-specified ToolBench benchmarks, experienced a significant performance collapse of approximately 50-64 percentage points on RRB queries, often falling below embedding-model baselines. Furthermore, these models frequently scored near-random on factual probes, indicating a lack of genuine semantic understanding despite effective retrieval.

Key takeaway

For Machine Learning Engineers deploying LLM agents over extensive tool catalogs, relying solely on standard retrieval benchmarks is insufficient. Your models may exhibit a knowledge-retrieval dissociation, performing poorly on realistic queries despite high in-distribution scores. You should integrate diagnostic frameworks like ToolSense to uncover true semantic understanding gaps. Consider using flat token formats and LoRA fine-tuning to mitigate knowledge erosion during retrieval training, ensuring your agents genuinely comprehend their tools for robust real-world performance.

Key insights

Parametric tool retrieval in LLMs often shows a knowledge-retrieval dissociation, failing on realistic queries despite high benchmark scores.

Principles

Parametric tool retrieval often learns surface-level query patterns.
Constrained decoding can mask genuine tool knowledge deficits.
Stage 2 retrieval fine-tuning can degrade semantic tool understanding.

Method

ToolSense generates three benchmarks: Realistic Retrieval Benchmark (RRB) with tiered queries, MCQ probing for discriminative factual knowledge, and QA probing for inferential factual knowledge. It also defines an Internalization Score (IS@k) for trie-dependency.

In practice

Audit LLM tool knowledge using ToolSense's diagnostic benchmarks.
Prefer flat token formats for parametric tool encoding.
Apply LoRA during Stage 2 fine-tuning to preserve semantic knowledge.

Topics

LLM Agents
Parametric Tool Retrieval
Diagnostic Benchmarking
Knowledge-Retrieval Dissociation
LoRA Fine-tuning
Virtual Tokens

Code references

SAP/toolsense

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.