ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

2026-02-04 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

ResearchClawBench (RCBench) is a new benchmark designed to evaluate end-to-end autonomous scientific research capabilities of AI agents and large language models. It comprises 40 tasks across 10 scientific domains, each derived from a real published paper with associated literature and raw data, while keeping the target paper hidden during evaluation. Expert-curated multimodal rubrics assess re-discovery and allow for new discoveries. Evaluations of seven autonomous research agents and seventeen native LLMs using the lightweight ResearchHarness reveal that current systems are far from reliable re-discovery. The top autonomous agent, Claude Code, achieved an average score of 21.5, and the best ResearchHarness LLM, Claude-Opus-4.7, scored 20.7, against a target-paper-level score of 50. Error analysis indicates failures primarily stem from experimental protocol and evidence mismatches, and missing scientific core.

Key takeaway

For AI Scientists and Machine Learning Engineers developing autonomous research agents, this benchmark highlights a significant gap: current systems average below 27 out of 100 for re-discovery. You should prioritize agent development on robust experimental protocol adherence, precise evidence generation, and deep scientific core understanding. Focus on minimizing mismatches in these areas, as they are critical failure points, rather than solely on report polish or iterative trial-and-error.

Key insights

ResearchClawBench reveals current AI agents and LLMs are far from reliably performing end-to-end scientific re-discovery.

Principles

Autonomous research capability requires comprehensive, verifiable evaluation.
Open-ended scientific outputs necessitate expert-curated, multimodal rubrics.
Current AI systems struggle with experimental protocol and evidence matching.

Method

RCBench tasks are built from real papers, providing raw data and literature, with expert rubrics evaluating outputs against hidden targets. ResearchHarness enables LLMs with tool-use via a ReAct-style loop.

In practice

Evaluate AI agents against RCBench's 40 tasks to benchmark progress.
Focus agent development on precise experimental protocol and evidence generation.

Topics

Autonomous Scientific Research
AI Agents
Large Language Models
Scientific Benchmarking
Research Evaluation
Multimodal Rubrics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.