ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Summary
ResearchClawBench (RCBench) is a new benchmark designed to evaluate end-to-end autonomous scientific research capabilities of AI agents and large language models. It comprises 40 tasks across 10 scientific domains, each derived from a real published paper with associated literature and raw data, while keeping the target paper hidden during evaluation. Expert-curated multimodal rubrics assess re-discovery and allow for new discoveries. Evaluations of seven autonomous research agents and seventeen native LLMs using the lightweight ResearchHarness reveal that current systems are far from reliable re-discovery. The top autonomous agent, Claude Code, achieved an average score of 21.5, and the best ResearchHarness LLM, Claude-Opus-4.7, scored 20.7, against a target-paper-level score of 50. Error analysis indicates failures primarily stem from experimental protocol and evidence mismatches, and missing scientific core.
Key takeaway
For AI Scientists and Machine Learning Engineers developing autonomous research agents, this benchmark highlights a significant gap: current systems average below 27 out of 100 for re-discovery. You should prioritize agent development on robust experimental protocol adherence, precise evidence generation, and deep scientific core understanding. Focus on minimizing mismatches in these areas, as they are critical failure points, rather than solely on report polish or iterative trial-and-error.
Key insights
ResearchClawBench reveals current AI agents and LLMs are far from reliably performing end-to-end scientific re-discovery.
Principles
- Autonomous research capability requires comprehensive, verifiable evaluation.
- Open-ended scientific outputs necessitate expert-curated, multimodal rubrics.
- Current AI systems struggle with experimental protocol and evidence matching.
Method
RCBench tasks are built from real papers, providing raw data and literature, with expert rubrics evaluating outputs against hidden targets. ResearchHarness enables LLMs with tool-use via a ReAct-style loop.
In practice
- Evaluate AI agents against RCBench's 40 tasks to benchmark progress.
- Focus agent development on precise experimental protocol and evidence generation.
Topics
- Autonomous Scientific Research
- AI Agents
- Large Language Models
- Scientific Benchmarking
- Research Evaluation
- Multimodal Rubrics
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.