Benchmarking Agentic Review Systems

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new study benchmarks agentic review systems, which are emerging to address the strain on traditional peer review from AI-assisted research. The evaluation covers two open-source systems (OpenAIReview, coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six large language models including frontier and efficient models. Benchmarking involved two main approaches: assessing how AI reviews of ICLR/NeurIPS papers correlate with external quality signals like citations and acceptance decisions, where OpenAIReview + GPT-5.5 achieved 83.0% pairwise accuracy. The second approach used a perturbation benchmark, injecting four error categories into papers across eight arXiv subject classes to measure detection recall; OpenAIReview + GPT-5.5 caught 71.6% of errors, with the union of six models reaching 83.3%. A public deployment of OpenAIReview showed positive user feedback (1.44 to 1 votes), though false positives and minor nitpicks were common complaints. The findings indicate AI reviews can track human quality judgments, detect errors, and receive positive user feedback, despite room for improvement.

Key takeaway

For research scientists evaluating or deploying AI-assisted peer review tools, this benchmarking study reveals that current agentic systems, particularly OpenAIReview + GPT-5.5, offer significant accuracy in quality assessment (83.0%) and error detection (71.6%). Your team should consider integrating diverse LLMs to boost error recall to 83.3% and focus on refining system design to mitigate false positives and minor issues. These systems can already track human judgments effectively and earn positive user feedback, making them a viable component of future review processes.

Key insights

Agentic AI review systems can track human quality judgments and detect errors, showing promise despite improvement needs.

Principles

Method

Evaluate AI review systems by benchmarking pairwise accuracy against external quality signals and measuring error detection recall using a perturbation benchmark with injected errors across various subject classes.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.