Automated reproducibility assessments in the social and behavioral sciences using large language models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Social Sciences & Behavioral Studies, Research Methodology & Innovation, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study demonstrates that large language models (LLMs) can automate reproducibility assessments in the social and behavioral sciences. Researchers developed an agentic LLM workflow using Claude Opus 4.7 to reanalyze N=76 published studies from the Multi100 project, comparing LLM-generated statistical analyses against original findings and human reanalyses. For 69 studies where a viable effect size was produced, the LLM pipeline recovered original effect sizes within a ±0.05 Cohen’s d tolerance in 41% of cases. It matched the original study's qualitative conclusion in 96% of cases. For comparison, human reanalysts achieved 34% effect size recovery and 74% qualitative conclusion matches. These results indicate LLMs offer a scalable tool for systematic auditing of empirical results, performing comparably to or better than human reanalysts on these metrics.

Key takeaway

Research Scientists or journal editors evaluating research claims or implementing quality control should consider integrating automated LLM-based reanalysis into your workflow. LLMs like Claude Opus 4.7 offer a viable, scalable first-pass screen for computational reproducibility, reducing manual burden and enhancing systematic auditing of empirical literature. This approach complements, rather than replaces, expert judgment, flagging claims that warrant closer human scrutiny.

Key insights

Large language models can automate scientific reproducibility checks, matching or exceeding human reanalysis performance.

Principles

Reproducibility assessments are resource-intensive.
LLMs can interpret study materials for analysis.
Automated reanalysis can scale quality control.

Method

An agentic LLM workflow receives study data, a statistical claim, and article context. It independently writes and executes statistical code in a sandbox to reproduce claims, aggregating results over five runs.

In practice

Use LLMs for initial reproducibility screening.
Integrate LLM checks into journal workflows.
Audit empirical literature systematically.

Topics

Large Language Models
Reproducibility Assessment
Social Sciences Research
Behavioral Sciences
Automated Data Analysis
Scientific Quality Control

Code references

UKGovernmentBEIS/inspect_ai

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.