Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Summary
This paper introduces a novel framework for augmenting human evaluation of AI systems, particularly Large Language Models (LLMs), by integrating LLM-generated ratings as auxiliary data rather than as a direct substitute for human judgment. The proposed method reframes LLM evaluation as a two-stage sampling design, where LLM ratings are collected for all observations in the first stage, and human ratings are obtained for a subsample in the second stage. Utilizing a doubly robust estimator from missing data literature, the framework provides a formal basis for study design, including sample size calculations for both human and LLM ratings to achieve a targeted statistical power. It also demonstrates how to efficiently allocate human ratings, especially for evaluation types where LLM predictability is lower, thereby optimizing resource use and enhancing the rigor of AI system validation in high-stakes applications like clinical quality monitoring and regulatory assessment.
Key takeaway
For AI scientists and MLOps engineers designing evaluation studies for LLMs in high-stakes domains, you should adopt a two-stage sampling approach that integrates LLM ratings as auxiliary data. This method allows you to formally calculate the optimal number of human and LLM reviews needed to achieve desired statistical precision, ensuring rigorous validation while managing annotation costs. Prioritize human review allocation to areas where LLM predictive quality is lower to maximize efficiency and maintain robust oversight.
Key insights
Augment human evaluation with LLM ratings using a two-stage sampling design and a doubly robust estimator.
Principles
- Human ratings are the gold standard.
- LLM ratings serve as auxiliary data.
- Validity of inference is guaranteed by design.
Method
Employ a two-stage sampling design: collect LLM ratings for all units, then human ratings for a subsample. Use a doubly robust estimator to handle incomplete human data, leveraging known response probabilities.
In practice
- Determine human/LLM sample sizes for target power.
- Allocate more human reviews where LLM predictability is low.
- Use pilot data to estimate R-squared for sample size planning.
Topics
- LLM Evaluation
- Two-stage Sampling Design
- Doubly Robust Estimator
- Sample Size Calculation
- Human Oversight
Best for: AI Scientist, MLOps Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.