How strongly do you believe LLM judges on the for the ML papers?? [D]

2026-04-29 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, quick

Summary

A discussion among Reddit users explores the efficacy and limitations of using Large Language Models (LLMs) as judges for Machine Learning paper reviews. Participants share varied experiences, with some finding LLM judges, such as Gemma-3, to be positive and useful for identifying common issues like missing ablations or methodological gaps. Others express skepticism, noting that LLMs struggle with understanding novelty or incremental advances and often miss subtle theoretical contributions or contextual nuances. The consensus suggests that while LLMs can serve as valuable tools for initial surface-level checks and flagging potential weaknesses, human validation and expert judgment remain crucial for final publication decisions, especially for assessing deeper contributions and contextual understanding.

Key takeaway

For AI Scientists and Research Scientists evaluating ML papers, you should integrate LLM judges for preliminary checks on obvious issues like missing baselines or clarity problems. However, your final assessment must involve human expert review to accurately gauge novelty, subtle theoretical contributions, and contextual relevance, as LLMs frequently miss these critical aspects. Consider using statistical validation to ensure reliability between human and LLM assessments.

Key insights

LLM judges are useful for initial ML paper review checks but require human validation for nuanced contributions.

Principles

LLMs excel at identifying obvious paper issues.
Human experts are essential for nuanced judgment.
Calibration improves LLM review utility.

Method

Use statistical validation to assert inter-rater reliability between human and semantic measurements when addressing risk in LLM-assisted reviews.

In practice

Use LLMs for initial triage of ML papers.
Employ open-weight LLMs like Gemma-3 for judging.
Always include human validation for publications.

Topics

LLM Paper Review
Machine Learning Papers
Human Validation
Gemma-3
Stanford Agentic Reviewer

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.