ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

ResearchClawBench (RCBench) is a new benchmark designed to evaluate end-to-end autonomous scientific research capabilities of AI agents and large language models. It comprises 40 tasks across 10 scientific domains, each derived from a real published paper with associated literature and raw data, while keeping the target paper hidden during evaluation. Expert-curated multimodal rubrics assess re-discovery and allow for new discoveries. Evaluations of seven autonomous research agents and seventeen native LLMs using the lightweight ResearchHarness reveal that current systems are far from reliable re-discovery. The top autonomous agent, Claude Code, achieved an average score of 21.5, and the best ResearchHarness LLM, Claude-Opus-4.7, scored 20.7, against a target-paper-level score of 50. Error analysis indicates failures primarily stem from experimental protocol and evidence mismatches, and missing scientific core.

Key takeaway

For AI Scientists and Machine Learning Engineers developing autonomous research agents, this benchmark highlights a significant gap: current systems average below 27 out of 100 for re-discovery. You should prioritize agent development on robust experimental protocol adherence, precise evidence generation, and deep scientific core understanding. Focus on minimizing mismatches in these areas, as they are critical failure points, rather than solely on report polish or iterative trial-and-error.

Key insights

ResearchClawBench reveals current AI agents and LLMs are far from reliably performing end-to-end scientific re-discovery.

Principles

Method

RCBench tasks are built from real papers, providing raw data and literature, with expert rubrics evaluating outputs against hidden targets. ResearchHarness enables LLMs with tool-use via a ReAct-style loop.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.