Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Summary
NovelAPIBench is a new, fully automated dynamic benchmark designed to diagnose knowledge gaps in large language models (LLMs) when they use APIs absent from their pretraining data. It discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failures to six diagnostic categories. Across approximately 1.9K tasks, four base models, and five domains, the benchmark reveals that knowledge components are not interchangeable; usage examples are the strongest standalone signal. The best two-component settings pair signatures with either mechanisms or examples. Adding source code context can negatively impact performance due to import-path errors. Fine-tuning primarily teaches models to use provided bundles, a skill transferable to held-out libraries, suggesting retrieval supplies API content while tuning improves procedural integration.
Key takeaway
For Machine Learning Engineers developing LLM agents that interact with external APIs, you should prioritize providing explicit usage examples as the most effective knowledge component. When structuring API context, consider pairing signatures with either mechanisms or examples, depending on your specific domain. Recognize that fine-tuning enhances the model's ability to integrate provided API bundles, complementing retrieval for dynamic content.
Key insights
LLM tool use with novel APIs requires specific knowledge components, with usage examples being critical, and retrieval/tuning playing complementary roles.
Principles
- Knowledge components for API use are not interchangeable.
- Usage examples are the strongest standalone signal for novel API acquisition.
- Retrieval and tuning play complementary roles in LLM API integration.
Method
NovelAPIBench discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories for LLM evaluation.
In practice
- Prioritize usage examples when providing API context to LLMs.
- Combine API signatures with mechanisms or examples based on domain.
- Use fine-tuning to improve LLM procedural integration of API bundles.
Topics
- Large Language Models
- API Tool Use
- NovelAPIBench
- Code Generation
- Retrieval-Augmented Generation
- LLM Benchmarking
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.