ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
Summary
ReproRepo is introduced as a scalable framework designed to evaluate research reproducibility, addressing the limitations of existing benchmarks that require extensive manual effort. This framework utilizes human-raised GitHub issues as natural supervision to identify realistic reproduction blockers. Instantiated on 1,149 recent machine learning papers from major conferences, ReproRepo evaluated four frontier model-agent configurations. The study found that LLM agents, specifically Codex with GPT-5.5, can identify real-world reproducibility problems from paper-repository pairs, surfacing at least one semantically related human-reported blocker for approximately 90% of the papers. Agents are particularly effective at identifying visible failures and semantic regions, though less precise in exact localization.
Key takeaway
For research scientists or AI engineers tasked with evaluating the reproducibility of machine learning research, you should consider integrating LLM agents into your auditing workflow. ReproRepo demonstrates that agents like Codex with GPT-5.5 can effectively identify visible failures and semantic problem regions from paper-repository pairs, surfacing issues for nearly 90% of papers. This approach significantly reduces manual effort in initial assessments, allowing you to focus human expertise on exact localization and deeper problem-solving.
Key insights
ReproRepo scales reproducibility audits using human-raised GitHub issues as supervision for LLM agents.
Principles
- Human-raised GitHub issues provide effective supervision for reproducibility blockers.
- LLM agents can identify reproducibility problems without executing code.
Method
ReproRepo instantiates LLM agents on paper-repository pairs, using human-raised GitHub issues to identify and evaluate real-world reproducibility blockers.
In practice
- Evaluate LLM agents for real-world reproducibility auditing.
- Identify visible failures in research code and documentation.
Topics
- ReproRepo
- Reproducibility Audits
- LLM Agents
- GitHub Issues
- Machine Learning Research
- Code Reproducibility
Code references
Best for: AI Scientist, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.