Benchmarking Agentic Review Systems

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new study benchmarks agentic review systems, which are emerging to address the strain on traditional peer review from AI-assisted research. The evaluation covers two open-source systems (OpenAIReview, coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six large language models including frontier and efficient models. Benchmarking involved two main approaches: assessing how AI reviews of ICLR/NeurIPS papers correlate with external quality signals like citations and acceptance decisions, where OpenAIReview + GPT-5.5 achieved 83.0% pairwise accuracy. The second approach used a perturbation benchmark, injecting four error categories into papers across eight arXiv subject classes to measure detection recall; OpenAIReview + GPT-5.5 caught 71.6% of errors, with the union of six models reaching 83.3%. A public deployment of OpenAIReview showed positive user feedback (1.44 to 1 votes), though false positives and minor nitpicks were common complaints. The findings indicate AI reviews can track human quality judgments, detect errors, and receive positive user feedback, despite room for improvement.

Key takeaway

For research scientists evaluating or deploying AI-assisted peer review tools, this benchmarking study reveals that current agentic systems, particularly OpenAIReview + GPT-5.5, offer significant accuracy in quality assessment (83.0%) and error detection (71.6%). Your team should consider integrating diverse LLMs to boost error recall to 83.3% and focus on refining system design to mitigate false positives and minor issues. These systems can already track human judgments effectively and earn positive user feedback, making them a viable component of future review processes.

Key insights

Agentic AI review systems can track human quality judgments and detect errors, showing promise despite improvement needs.

Principles

AI reviews can exceed chance in tracking paper quality.
Different LLMs detect distinct types of errors.
User feedback indicates AI reviews are generally positive.

Method

Evaluate AI review systems by benchmarking pairwise accuracy against external quality signals and measuring error detection recall using a perturbation benchmark with injected errors across various subject classes.

In practice

Combine multiple LLMs for higher error detection recall.
Design harnesses to improve model performance.
Address false positives and minor nitpicks in AI review outputs.

Topics

Agentic AI Systems
Peer Review Automation
Large Language Models
Benchmarking
Error Detection
OpenAIReview

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.