Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

2026-05-19 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This paper introduces a novel framework for augmenting human evaluation of AI systems, particularly Large Language Models (LLMs), by integrating LLM-generated ratings as auxiliary data rather than as a direct substitute for human judgment. The proposed method reframes LLM evaluation as a two-stage sampling design, where LLM ratings are collected for all observations in the first stage, and human ratings are obtained for a subsample in the second stage. Utilizing a doubly robust estimator from missing data literature, the framework provides a formal basis for study design, including sample size calculations for both human and LLM ratings to achieve a targeted statistical power. It also demonstrates how to efficiently allocate human ratings, especially for evaluation types where LLM predictability is lower, thereby optimizing resource use and enhancing the rigor of AI system validation in high-stakes applications like clinical quality monitoring and regulatory assessment.

Key takeaway

For AI scientists and MLOps engineers designing evaluation studies for LLMs in high-stakes domains, you should adopt a two-stage sampling approach that integrates LLM ratings as auxiliary data. This method allows you to formally calculate the optimal number of human and LLM reviews needed to achieve desired statistical precision, ensuring rigorous validation while managing annotation costs. Prioritize human review allocation to areas where LLM predictive quality is lower to maximize efficiency and maintain robust oversight.

Key insights

Augment human evaluation with LLM ratings using a two-stage sampling design and a doubly robust estimator.

Principles

Human ratings are the gold standard.
LLM ratings serve as auxiliary data.
Validity of inference is guaranteed by design.

Method

Employ a two-stage sampling design: collect LLM ratings for all units, then human ratings for a subsample. Use a doubly robust estimator to handle incomplete human data, leveraging known response probabilities.

In practice

Determine human/LLM sample sizes for target power.
Allocate more human reviews where LLM predictability is low.
Use pilot data to estimate R-squared for sample size planning.

Topics

LLM Evaluation
Two-stage Sampling Design
Doubly Robust Estimator
Sample Size Calculation
Human Oversight

Best for: AI Scientist, MLOps Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.