Building better AI benchmarks: How many raters are enough?
Summary
Google Research scientists Flip Korn and Chris Welty introduced an evaluation framework for machine learning models that optimizes the trade-off between the number of items and raters per item, aiming for highly reproducible AI benchmarks. Published on March 31, 2026, their paper, "Forest vs Tree: The (*N*,*K*) Trade-off in Reproducible ML Evaluation," challenges the common practice of using 1-5 raters per item, which often fails to capture human disagreement. They developed an open-source simulator to stress-test various combinations of item scale (*N*, from 100 to 50,000) and crowd size (*K*, from 1 to 500 raters per item) across datasets like Toxicity, DICES, D3code, and Jobs. Key findings indicate that 3-5 raters are often insufficient, and the optimal rater-to-item ratio depends on the evaluation metric: "forest" (more items, fewer raters) for majority vote accuracy, and "tree" (fewer items, more raters) for capturing the full range of human opinion and nuance. The research suggests reproducible results are achievable with around 1,000 total annotations if optimized correctly.
Key takeaway
For AI Engineers designing evaluation benchmarks for subjective tasks, you should re-evaluate your current rater-per-item strategy. If your goal is to capture the full spectrum of human opinion rather than just a majority vote, increase your raters per item beyond the typical 3-5. Leveraging the open-source simulator can help you optimize your annotation budget for reproducibility, ensuring your models are evaluated against a more realistic and nuanced "ground truth."
Key insights
Optimizing rater-to-item ratios is crucial for reproducible AI benchmarks that capture human disagreement.
Principles
- Low rater counts (1-5) are often insufficient for nuanced human evaluation.
- The optimal rater-to-item ratio depends on the evaluation metric.
- Reproducible results are achievable with modest annotation budgets.
Method
A simulator based on real-world subjective datasets (e.g., toxicity) was used to test thousands of combinations of item scale (*N*) and raters per item (*K*) to find statistically reliable configurations.
In practice
- Use >10 raters per item for capturing human nuance.
- Prioritize more items for majority-vote accuracy metrics.
- Prioritize more raters per item for capturing opinion variation.
Topics
- AI Benchmarking
- Human Disagreement
- Reproducibility
- Rater Optimization
- Evaluation Framework
Code references
- google-research/vet
- google-research-datasets/dices-dataset
- google-research-datasets/D3code
- Homan-Lab/pldl_data_internal
Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.