ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Summary
ToolSense, an open-source diagnostic framework, evaluates Large Language Models' (LLMs) parametric tool knowledge, addressing the limitation of existing benchmarks that may not reveal true tool understanding. It automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with varied ambiguity, an MCQ probing benchmark, and a QA probing benchmark. Applied to ToolBench's ~47k tools and five parametric model configurations, ToolSense uncovered a "knowledge-retrieval dissociation." Models, despite strong performance on verbose, fully-specified ToolBench benchmarks, experienced a significant performance collapse of approximately 50-64 percentage points on RRB queries, often falling below embedding-model baselines. Furthermore, these models frequently scored near-random on factual probes, indicating a lack of genuine semantic understanding despite effective retrieval.
Key takeaway
For Machine Learning Engineers deploying LLM agents over extensive tool catalogs, relying solely on standard retrieval benchmarks is insufficient. Your models may exhibit a knowledge-retrieval dissociation, performing poorly on realistic queries despite high in-distribution scores. You should integrate diagnostic frameworks like ToolSense to uncover true semantic understanding gaps. Consider using flat token formats and LoRA fine-tuning to mitigate knowledge erosion during retrieval training, ensuring your agents genuinely comprehend their tools for robust real-world performance.
Key insights
Parametric tool retrieval in LLMs often shows a knowledge-retrieval dissociation, failing on realistic queries despite high benchmark scores.
Principles
- Parametric tool retrieval often learns surface-level query patterns.
- Constrained decoding can mask genuine tool knowledge deficits.
- Stage 2 retrieval fine-tuning can degrade semantic tool understanding.
Method
ToolSense generates three benchmarks: Realistic Retrieval Benchmark (RRB) with tiered queries, MCQ probing for discriminative factual knowledge, and QA probing for inferential factual knowledge. It also defines an Internalization Score (IS@k) for trie-dependency.
In practice
- Audit LLM tool knowledge using ToolSense's diagnostic benchmarks.
- Prefer flat token formats for parametric tool encoding.
- Apply LoRA during Stage 2 fine-tuning to preserve semantic knowledge.
Topics
- LLM Agents
- Parametric Tool Retrieval
- Diagnostic Benchmarking
- Knowledge-Retrieval Dissociation
- LoRA Fine-tuning
- Virtual Tokens
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.