PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
Summary
The Peer Review AI Benchmark (PRAIB) is introduced as a novel framework to assess how Large Language Models (LLMs) engage with scientific manuscripts during peer review. Motivated by the increasing number of paper submissions, PRAIB comprises metrics for review specificity, style, and engagement. A large-scale empirical study was conducted, analyzing 11,000 reviews generated by five proprietary and open-source LLMs for 1,000 ICLR and NeurIPS papers from 2021 to 2025. This analysis compared machine-generated reviews against human feedback, revealing significant divergences: LLM ratings are less variable, positively biased, and overconfident, with model-dependent cross-reference patterns. LLMs also tend to produce longer, more complex reviews while frequently overlooking atomic weaknesses identified by human reviewers.
Key takeaway
For AI scientists evaluating LLMs for peer review automation, recognize that current models exhibit systematic biases like positive rating bias and overconfidence. You should use diagnostic tools like PRAIB to identify specific LLM limitations and areas requiring human oversight or further model refinement before deployment. This ensures that LLM assistance genuinely augments, rather than compromises, review quality and fairness.
Key insights
LLMs diverge significantly from human peer review behavior, necessitating specialized benchmarks for reliable integration.
Principles
- LLM ratings are less variable, positively biased, and overconfident.
- LLM cross-reference patterns are model-dependent and distinct from human norms.
- LLMs generate longer, more complex reviews but often miss atomic weaknesses.
Method
The PRAIB framework measures review specificity, style, and engagement using defined metrics, comparing machine-generated reviews against human feedback across diverse prompting strategies to identify behavioral divergences.
In practice
- Use PRAIB to diagnose LLM review process support capabilities.
- Identify aspects of LLM reviewing needing further development before deployment.
Topics
- PRAIB
- Large Language Models
- Peer Review
- AI Benchmarking
- Review Automation
- LLM Bias
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.