Automated reproducibility assessments in the social and behavioral sciences using large language models
Summary
A study demonstrates that large language models (LLMs) can automate reproducibility assessments in the social and behavioral sciences. Researchers developed an agentic LLM workflow using Claude Opus 4.7 to reanalyze N=76 published studies from the Multi100 project, comparing LLM-generated statistical analyses against original findings and human reanalyses. For 69 studies where a viable effect size was produced, the LLM pipeline recovered original effect sizes within a ±0.05 Cohen’s d tolerance in 41% of cases. It matched the original study's qualitative conclusion in 96% of cases. For comparison, human reanalysts achieved 34% effect size recovery and 74% qualitative conclusion matches. These results indicate LLMs offer a scalable tool for systematic auditing of empirical results, performing comparably to or better than human reanalysts on these metrics.
Key takeaway
Research Scientists or journal editors evaluating research claims or implementing quality control should consider integrating automated LLM-based reanalysis into your workflow. LLMs like Claude Opus 4.7 offer a viable, scalable first-pass screen for computational reproducibility, reducing manual burden and enhancing systematic auditing of empirical literature. This approach complements, rather than replaces, expert judgment, flagging claims that warrant closer human scrutiny.
Key insights
Large language models can automate scientific reproducibility checks, matching or exceeding human reanalysis performance.
Principles
- Reproducibility assessments are resource-intensive.
- LLMs can interpret study materials for analysis.
- Automated reanalysis can scale quality control.
Method
An agentic LLM workflow receives study data, a statistical claim, and article context. It independently writes and executes statistical code in a sandbox to reproduce claims, aggregating results over five runs.
In practice
- Use LLMs for initial reproducibility screening.
- Integrate LLM checks into journal workflows.
- Audit empirical literature systematically.
Topics
- Large Language Models
- Reproducibility Assessment
- Social Sciences Research
- Behavioral Sciences
- Automated Data Analysis
- Scientific Quality Control
Code references
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.