Automated reproducibility assessments in the social and behavioral sciences using large language models
Summary
A new study demonstrates that large language models (LLMs) can automate reproducibility assessments in the social and behavioral sciences, a task traditionally performed by resource-intensive human reanalysis. Researchers evaluated LLM performance against N=76 published studies with predefined claims, comparing LLM-generated analyses to original findings and human reanalyses. While 7 studies yielded no viable effect size from the LLM, for the remaining studies, the LLM pipeline recovered original effect sizes in 41% of cases using a +/-0.05 Cohen's d tolerance. Crucially, LLMs matched the original study's qualitative conclusion in 96% of instances. This performance surpasses human reanalysts, who recovered original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases, positioning LLMs as a scalable solution for auditing empirical results.
Key takeaway
For research scientists evaluating the reproducibility of published findings, integrating large language models into your workflow offers a scalable and efficient solution. You should consider deploying LLM-based tools for initial assessments, as they demonstrate superior qualitative agreement (96%) compared to human reanalysts (74%). This approach can significantly reduce resource intensity, allowing your team to systematically audit a larger volume of empirical results and focus human expertise on more nuanced or challenging cases.
Key insights
Large language models can automate reproducibility assessments in social sciences, outperforming human reanalysts in qualitative agreement.
Principles
- LLMs offer scalable reproducibility assessment.
- Qualitative agreement is high with LLM reanalysis.
- LLM performance can exceed human reanalysts.
Method
An LLM pipeline compares generated analyses with original findings and human reanalysis across N=76 studies to assess effect size recovery and qualitative conclusion agreement.
In practice
- Implement LLMs for initial reproducibility checks.
- Use LLMs to audit large sets of empirical results.
- Focus human effort on complex, LLM-flagged cases.
Topics
- Reproducibility Assessment
- Large Language Models
- Social Sciences Research
- Behavioral Sciences
- Automated Auditing
- Effect Size Estimation
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.