Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

NovelAPIBench is a new, fully automated dynamic benchmark designed to diagnose knowledge gaps in large language models (LLMs) when they use APIs absent from their pretraining data. It discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failures to six diagnostic categories. Across approximately 1.9K tasks, four base models, and five domains, the benchmark reveals that knowledge components are not interchangeable; usage examples are the strongest standalone signal. The best two-component settings pair signatures with either mechanisms or examples. Adding source code context can negatively impact performance due to import-path errors. Fine-tuning primarily teaches models to use provided bundles, a skill transferable to held-out libraries, suggesting retrieval supplies API content while tuning improves procedural integration.

Key takeaway

For Machine Learning Engineers developing LLM agents that interact with external APIs, you should prioritize providing explicit usage examples as the most effective knowledge component. When structuring API context, consider pairing signatures with either mechanisms or examples, depending on your specific domain. Recognize that fine-tuning enhances the model's ability to integrate provided API bundles, complementing retrieval for dynamic content.

Key insights

LLM tool use with novel APIs requires specific knowledge components, with usage examples being critical, and retrieval/tuning playing complementary roles.

Principles

Method

NovelAPIBench discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories for LLM evaluation.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.