Automated reproducibility assessments in the social and behavioral sciences using large language models

2026-06-11 · Source: Artificial Intelligence · Field: Science & Research — Social Sciences & Behavioral Studies, Research Methodology & Innovation, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study demonstrates that large language models (LLMs) can automate reproducibility assessments in the social and behavioral sciences, a task traditionally performed by resource-intensive human reanalysis. Researchers evaluated LLM performance against N=76 published studies with predefined claims, comparing LLM-generated analyses to original findings and human reanalyses. While 7 studies yielded no viable effect size from the LLM, for the remaining studies, the LLM pipeline recovered original effect sizes in 41% of cases using a +/-0.05 Cohen's d tolerance. Crucially, LLMs matched the original study's qualitative conclusion in 96% of instances. This performance surpasses human reanalysts, who recovered original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases, positioning LLMs as a scalable solution for auditing empirical results.

Key takeaway

For research scientists evaluating the reproducibility of published findings, integrating large language models into your workflow offers a scalable and efficient solution. You should consider deploying LLM-based tools for initial assessments, as they demonstrate superior qualitative agreement (96%) compared to human reanalysts (74%). This approach can significantly reduce resource intensity, allowing your team to systematically audit a larger volume of empirical results and focus human expertise on more nuanced or challenging cases.

Key insights

Large language models can automate reproducibility assessments in social sciences, outperforming human reanalysts in qualitative agreement.

Principles

LLMs offer scalable reproducibility assessment.
Qualitative agreement is high with LLM reanalysis.
LLM performance can exceed human reanalysts.

Method

An LLM pipeline compares generated analyses with original findings and human reanalysis across N=76 studies to assess effect size recovery and qualitative conclusion agreement.

In practice

Implement LLMs for initial reproducibility checks.
Use LLMs to audit large sets of empirical results.
Focus human effort on complex, LLM-flagged cases.

Topics

Reproducibility Assessment
Large Language Models
Social Sciences Research
Behavioral Sciences
Automated Auditing
Effect Size Estimation

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.