Benchmarking Agentic Review Systems
Summary
A new study benchmarks agentic review systems, which are emerging to address the strain on traditional peer review from AI-assisted research. The evaluation covers two open-source systems (OpenAIReview, coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six large language models including frontier and efficient models. Benchmarking involved two main approaches: assessing how AI reviews of ICLR/NeurIPS papers correlate with external quality signals like citations and acceptance decisions, where OpenAIReview + GPT-5.5 achieved 83.0% pairwise accuracy. The second approach used a perturbation benchmark, injecting four error categories into papers across eight arXiv subject classes to measure detection recall; OpenAIReview + GPT-5.5 caught 71.6% of errors, with the union of six models reaching 83.3%. A public deployment of OpenAIReview showed positive user feedback (1.44 to 1 votes), though false positives and minor nitpicks were common complaints. The findings indicate AI reviews can track human quality judgments, detect errors, and receive positive user feedback, despite room for improvement.
Key takeaway
For research scientists evaluating or deploying AI-assisted peer review tools, this benchmarking study reveals that current agentic systems, particularly OpenAIReview + GPT-5.5, offer significant accuracy in quality assessment (83.0%) and error detection (71.6%). Your team should consider integrating diverse LLMs to boost error recall to 83.3% and focus on refining system design to mitigate false positives and minor issues. These systems can already track human judgments effectively and earn positive user feedback, making them a viable component of future review processes.
Key insights
Agentic AI review systems can track human quality judgments and detect errors, showing promise despite improvement needs.
Principles
- AI reviews can exceed chance in tracking paper quality.
- Different LLMs detect distinct types of errors.
- User feedback indicates AI reviews are generally positive.
Method
Evaluate AI review systems by benchmarking pairwise accuracy against external quality signals and measuring error detection recall using a perturbation benchmark with injected errors across various subject classes.
In practice
- Combine multiple LLMs for higher error detection recall.
- Design harnesses to improve model performance.
- Address false positives and minor nitpicks in AI review outputs.
Topics
- Agentic AI Systems
- Peer Review Automation
- Large Language Models
- Benchmarking
- Error Detection
- OpenAIReview
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.