Automated reproducibility assessments in the social and behavioral sciences using large language models

· Source: cs.AI updates on arXiv.org · Field: Science & Research — Social Sciences & Behavioral Studies, Research Methodology & Innovation, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study demonstrates that large language models (LLMs) can automate reproducibility assessments in the social and behavioral sciences. Researchers developed an agentic LLM workflow using Claude Opus 4.7 to reanalyze N=76 published studies from the Multi100 project, comparing LLM-generated statistical analyses against original findings and human reanalyses. For 69 studies where a viable effect size was produced, the LLM pipeline recovered original effect sizes within a ±0.05 Cohen’s d tolerance in 41% of cases. It matched the original study's qualitative conclusion in 96% of cases. For comparison, human reanalysts achieved 34% effect size recovery and 74% qualitative conclusion matches. These results indicate LLMs offer a scalable tool for systematic auditing of empirical results, performing comparably to or better than human reanalysts on these metrics.

Key takeaway

Research Scientists or journal editors evaluating research claims or implementing quality control should consider integrating automated LLM-based reanalysis into your workflow. LLMs like Claude Opus 4.7 offer a viable, scalable first-pass screen for computational reproducibility, reducing manual burden and enhancing systematic auditing of empirical literature. This approach complements, rather than replaces, expert judgment, flagging claims that warrant closer human scrutiny.

Key insights

Large language models can automate scientific reproducibility checks, matching or exceeding human reanalysis performance.

Principles

Method

An agentic LLM workflow receives study data, a statistical claim, and article context. It independently writes and executes statistical code in a sandbox to reproduce claims, aggregating results over five runs.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.