ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

ReproRepo is introduced as a scalable framework designed to evaluate research reproducibility, addressing the limitations of existing benchmarks that require extensive manual effort. This framework utilizes human-raised GitHub issues as natural supervision to identify realistic reproduction blockers. Instantiated on 1,149 recent machine learning papers from major conferences, ReproRepo evaluated four frontier model-agent configurations. The study found that LLM agents, specifically Codex with GPT-5.5, can identify real-world reproducibility problems from paper-repository pairs, surfacing at least one semantically related human-reported blocker for approximately 90% of the papers. Agents are particularly effective at identifying visible failures and semantic regions, though less precise in exact localization.

Key takeaway

For research scientists or AI engineers tasked with evaluating the reproducibility of machine learning research, you should consider integrating LLM agents into your auditing workflow. ReproRepo demonstrates that agents like Codex with GPT-5.5 can effectively identify visible failures and semantic problem regions from paper-repository pairs, surfacing issues for nearly 90% of papers. This approach significantly reduces manual effort in initial assessments, allowing you to focus human expertise on exact localization and deeper problem-solving.

Key insights

ReproRepo scales reproducibility audits using human-raised GitHub issues as supervision for LLM agents.

Principles

Human-raised GitHub issues provide effective supervision for reproducibility blockers.
LLM agents can identify reproducibility problems without executing code.

Method

ReproRepo instantiates LLM agents on paper-repository pairs, using human-raised GitHub issues to identify and evaluate real-world reproducibility blockers.

In practice

Evaluate LLM agents for real-world reproducibility auditing.
Identify visible failures in research code and documentation.

Topics

ReproRepo
Reproducibility Audits
LLM Agents
GitHub Issues
Machine Learning Research
Code Reproducibility

Code references

LithiumDA/ReproRepo

Best for: AI Scientist, Research Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.