AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
Summary
The AutoResearchBench introduces a new benchmark designed to evaluate AI agents on complex scientific literature discovery, addressing a critical gap in autonomous scientific research. This benchmark comprises 1,000 expert-curated queries across eight computer science domains, utilizing a controlled corpus of over three million full-text arXiv papers. It features two task types: "Deep Research," which requires identifying a specific target paper through multi-step probing, and "Wide Research," demanding comprehensive collection of papers meeting given conditions. Unlike general web browsing benchmarks, AutoResearchBench is research-oriented, literature-focused, and open-ended, requiring in-depth comprehension and fine-grained utilization of full-text information. Current state-of-the-art LLMs and end-to-end systems achieve low performance, with top scores of only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, highlighting significant challenges in scientific reasoning and comprehensive evidence aggregation.
Key takeaway
For AI scientists and machine learning engineers developing autonomous research agents, recognize that current LLMs are severely limited in scientific literature discovery. Your development efforts should focus on enhancing agents' capabilities in deep scientific reasoning, comprehensive evidence aggregation from full-text documents, and robust tool utilization, rather than simply increasing search budget or turns, to bridge the significant performance gap identified by AutoResearchBench.
Key insights
AI agents struggle significantly with complex scientific literature discovery, achieving less than 10% accuracy on dedicated benchmarks.
Principles
- Scientific literature discovery requires deep comprehension, not shallow matching.
- Effective agents must reason about correctness and completeness.
- Full-text analysis is crucial for verifying fine-grained technical conditions.
Method
AutoResearchBench uses a human-machine pipeline to construct 1,000 problems, including "Deep Research" for precise identification and "Wide Research" for exhaustive coverage, over a 3M+ arXiv corpus.
In practice
- Evaluate agents on full-text scientific corpora, not just abstracts.
- Prioritize scientific reasoning over increased search turns.
- Implement robust tool use and evidence aggregation mechanisms.
Topics
- AutoResearchBench
- AI Agents
- Scientific Literature Discovery
- Deep Research Tasks
- Wide Research Tasks
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.