How strongly do you believe LLM judges on the for the ML papers?? [D]
Summary
A discussion among Reddit users explores the efficacy and limitations of using Large Language Models (LLMs) as judges for Machine Learning paper reviews. Participants share varied experiences, with some finding LLM judges, such as Gemma-3, to be positive and useful for identifying common issues like missing ablations or methodological gaps. Others express skepticism, noting that LLMs struggle with understanding novelty or incremental advances and often miss subtle theoretical contributions or contextual nuances. The consensus suggests that while LLMs can serve as valuable tools for initial surface-level checks and flagging potential weaknesses, human validation and expert judgment remain crucial for final publication decisions, especially for assessing deeper contributions and contextual understanding.
Key takeaway
For AI Scientists and Research Scientists evaluating ML papers, you should integrate LLM judges for preliminary checks on obvious issues like missing baselines or clarity problems. However, your final assessment must involve human expert review to accurately gauge novelty, subtle theoretical contributions, and contextual relevance, as LLMs frequently miss these critical aspects. Consider using statistical validation to ensure reliability between human and LLM assessments.
Key insights
LLM judges are useful for initial ML paper review checks but require human validation for nuanced contributions.
Principles
- LLMs excel at identifying obvious paper issues.
- Human experts are essential for nuanced judgment.
- Calibration improves LLM review utility.
Method
Use statistical validation to assert inter-rater reliability between human and semantic measurements when addressing risk in LLM-assisted reviews.
In practice
- Use LLMs for initial triage of ML papers.
- Employ open-weight LLMs like Gemma-3 for judging.
- Always include human validation for publications.
Topics
- LLM Paper Review
- Machine Learning Papers
- Human Validation
- Gemma-3
- Stanford Agentic Reviewer
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.