ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ToolSense, an open-source diagnostic framework, evaluates Large Language Models' (LLMs) parametric tool knowledge, addressing the limitation of existing benchmarks that may not reveal true tool understanding. It automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with varied ambiguity, an MCQ probing benchmark, and a QA probing benchmark. Applied to ToolBench's ~47k tools and five parametric model configurations, ToolSense uncovered a "knowledge-retrieval dissociation." Models, despite strong performance on verbose, fully-specified ToolBench benchmarks, experienced a significant performance collapse of approximately 50-64 percentage points on RRB queries, often falling below embedding-model baselines. Furthermore, these models frequently scored near-random on factual probes, indicating a lack of genuine semantic understanding despite effective retrieval.

Key takeaway

For Machine Learning Engineers deploying LLM agents over extensive tool catalogs, relying solely on standard retrieval benchmarks is insufficient. Your models may exhibit a knowledge-retrieval dissociation, performing poorly on realistic queries despite high in-distribution scores. You should integrate diagnostic frameworks like ToolSense to uncover true semantic understanding gaps. Consider using flat token formats and LoRA fine-tuning to mitigate knowledge erosion during retrieval training, ensuring your agents genuinely comprehend their tools for robust real-world performance.

Key insights

Parametric tool retrieval in LLMs often shows a knowledge-retrieval dissociation, failing on realistic queries despite high benchmark scores.

Principles

Method

ToolSense generates three benchmarks: Realistic Retrieval Benchmark (RRB) with tiered queries, MCQ probing for discriminative factual knowledge, and QA probing for inferential factual knowledge. It also defines an Internalization Score (IS@k) for trie-dependency.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.