ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
Summary
ReviewGrounder is a new rubric-guided, tool-integrated multi-agent framework designed to improve the substantiveness of LLM-generated peer reviews. It addresses the common issue of superficial LLM reviews by incorporating explicit rubrics and contextual grounding, components often underutilized in existing LLM-based review systems. The framework decomposes the reviewing process into drafting and grounding stages, where initial drafts are enriched through targeted evidence consolidation. To evaluate this, the authors introduce REVIEWBENCH, a benchmark that assesses review text against paper-specific rubrics derived from official guidelines, paper content, and human reviews. Experiments show that ReviewGrounder, utilizing a Phi-4-14B drafter and a GPT-OSS-120B grounding stage, outperforms stronger baselines like GPT-4.1 and DeepSeek-R1-670B in both human judgment alignment and rubric-based quality across eight dimensions.
Key takeaway
For AI scientists and NLP engineers developing peer review support systems, ReviewGrounder demonstrates a clear path to overcoming superficial LLM outputs. Your systems should integrate explicit rubrics and a multi-stage, evidence-grounded approach to significantly improve review quality and alignment with human standards, even with smaller LLMs. Consider adopting the drafting and grounding paradigm to enhance the depth and utility of automated feedback.
Key insights
Integrating rubrics and contextual grounding significantly enhances LLM-generated peer review quality and substantiveness.
Principles
- Decompose complex tasks into distinct stages.
- Ground LLM outputs with external evidence.
- Utilize explicit rubrics for quality assessment.
Method
ReviewGrounder employs a multi-agent framework, separating review generation into a drafting stage and a grounding stage to enrich initial drafts with targeted evidence, guided by paper-specific rubrics.
In practice
- Use Phi-4-14B for initial draft generation.
- Employ GPT-OSS-120B for evidence grounding.
- Develop paper-specific rubrics for evaluation.
Topics
- REVIEWGROUNDER
- Peer Review Automation
- LLM-based Review
- Rubric-Guided Agents
- Multi-Agent Systems
Code references
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.