AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

2026-04-29 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

The AutoResearchBench introduces a new benchmark designed to evaluate AI agents on complex scientific literature discovery, addressing a critical gap in autonomous scientific research. This benchmark comprises 1,000 expert-curated queries across eight computer science domains, utilizing a controlled corpus of over three million full-text arXiv papers. It features two task types: "Deep Research," which requires identifying a specific target paper through multi-step probing, and "Wide Research," demanding comprehensive collection of papers meeting given conditions. Unlike general web browsing benchmarks, AutoResearchBench is research-oriented, literature-focused, and open-ended, requiring in-depth comprehension and fine-grained utilization of full-text information. Current state-of-the-art LLMs and end-to-end systems achieve low performance, with top scores of only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, highlighting significant challenges in scientific reasoning and comprehensive evidence aggregation.

Key takeaway

For AI scientists and machine learning engineers developing autonomous research agents, recognize that current LLMs are severely limited in scientific literature discovery. Your development efforts should focus on enhancing agents' capabilities in deep scientific reasoning, comprehensive evidence aggregation from full-text documents, and robust tool utilization, rather than simply increasing search budget or turns, to bridge the significant performance gap identified by AutoResearchBench.

Key insights

AI agents struggle significantly with complex scientific literature discovery, achieving less than 10% accuracy on dedicated benchmarks.

Principles

Scientific literature discovery requires deep comprehension, not shallow matching.
Effective agents must reason about correctness and completeness.
Full-text analysis is crucial for verifying fine-grained technical conditions.

Method

AutoResearchBench uses a human-machine pipeline to construct 1,000 problems, including "Deep Research" for precise identification and "Wide Research" for exhaustive coverage, over a 3M+ arXiv corpus.

In practice

Evaluate agents on full-text scientific corpora, not just abstracts.
Prioritize scientific reasoning over increased search turns.
Implement robust tool use and evidence aggregation mechanisms.

Topics

AutoResearchBench
AI Agents
Scientific Literature Discovery
Deep Research Tasks
Wide Research Tasks

Code references

CherYou/AutoResearchBench

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.