Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

NovelAPIBench is a new, fully automated dynamic benchmark designed to diagnose knowledge gaps in large language models (LLMs) when they use APIs absent from their pretraining data. It discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failures to six diagnostic categories. Across approximately 1.9K tasks, four base models, and five domains, the benchmark reveals that knowledge components are not interchangeable; usage examples are the strongest standalone signal. The best two-component settings pair signatures with either mechanisms or examples. Adding source code context can negatively impact performance due to import-path errors. Fine-tuning primarily teaches models to use provided bundles, a skill transferable to held-out libraries, suggesting retrieval supplies API content while tuning improves procedural integration.

Key takeaway

For Machine Learning Engineers developing LLM agents that interact with external APIs, you should prioritize providing explicit usage examples as the most effective knowledge component. When structuring API context, consider pairing signatures with either mechanisms or examples, depending on your specific domain. Recognize that fine-tuning enhances the model's ability to integrate provided API bundles, complementing retrieval for dynamic content.

Key insights

LLM tool use with novel APIs requires specific knowledge components, with usage examples being critical, and retrieval/tuning playing complementary roles.

Principles

Knowledge components for API use are not interchangeable.
Usage examples are the strongest standalone signal for novel API acquisition.
Retrieval and tuning play complementary roles in LLM API integration.

Method

NovelAPIBench discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories for LLM evaluation.

In practice

Prioritize usage examples when providing API context to LLMs.
Combine API signatures with mechanisms or examples based on domain.
Use fine-tuning to improve LLM procedural integration of API bundles.

Topics

Large Language Models
API Tool Use
NovelAPIBench
Code Generation
Retrieval-Augmented Generation
LLM Benchmarking

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.