AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

The AutoResearchBench introduces a new benchmark designed to evaluate AI agents on complex scientific literature discovery, addressing a critical gap in autonomous scientific research. This benchmark comprises 1,000 expert-curated queries across eight computer science domains, utilizing a controlled corpus of over three million full-text arXiv papers. It features two task types: "Deep Research," which requires identifying a specific target paper through multi-step probing, and "Wide Research," demanding comprehensive collection of papers meeting given conditions. Unlike general web browsing benchmarks, AutoResearchBench is research-oriented, literature-focused, and open-ended, requiring in-depth comprehension and fine-grained utilization of full-text information. Current state-of-the-art LLMs and end-to-end systems achieve low performance, with top scores of only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, highlighting significant challenges in scientific reasoning and comprehensive evidence aggregation.

Key takeaway

For AI scientists and machine learning engineers developing autonomous research agents, recognize that current LLMs are severely limited in scientific literature discovery. Your development efforts should focus on enhancing agents' capabilities in deep scientific reasoning, comprehensive evidence aggregation from full-text documents, and robust tool utilization, rather than simply increasing search budget or turns, to bridge the significant performance gap identified by AutoResearchBench.

Key insights

AI agents struggle significantly with complex scientific literature discovery, achieving less than 10% accuracy on dedicated benchmarks.

Principles

Method

AutoResearchBench uses a human-machine pipeline to construct 1,000 problems, including "Deep Research" for precise identification and "Wide Research" for exhaustive coverage, over a 3M+ arXiv corpus.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.