Building better AI benchmarks: How many raters are enough?

2026-03-31 · Source: The latest research from Google · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Google Research scientists Flip Korn and Chris Welty introduced an evaluation framework for machine learning models that optimizes the trade-off between the number of items and raters per item, aiming for highly reproducible AI benchmarks. Published on March 31, 2026, their paper, "Forest vs Tree: The (*N*,*K*) Trade-off in Reproducible ML Evaluation," challenges the common practice of using 1-5 raters per item, which often fails to capture human disagreement. They developed an open-source simulator to stress-test various combinations of item scale (*N*, from 100 to 50,000) and crowd size (*K*, from 1 to 500 raters per item) across datasets like Toxicity, DICES, D3code, and Jobs. Key findings indicate that 3-5 raters are often insufficient, and the optimal rater-to-item ratio depends on the evaluation metric: "forest" (more items, fewer raters) for majority vote accuracy, and "tree" (fewer items, more raters) for capturing the full range of human opinion and nuance. The research suggests reproducible results are achievable with around 1,000 total annotations if optimized correctly.

Key takeaway

For AI Engineers designing evaluation benchmarks for subjective tasks, you should re-evaluate your current rater-per-item strategy. If your goal is to capture the full spectrum of human opinion rather than just a majority vote, increase your raters per item beyond the typical 3-5. Leveraging the open-source simulator can help you optimize your annotation budget for reproducibility, ensuring your models are evaluated against a more realistic and nuanced "ground truth."

Key insights

Optimizing rater-to-item ratios is crucial for reproducible AI benchmarks that capture human disagreement.

Principles

Low rater counts (1-5) are often insufficient for nuanced human evaluation.
The optimal rater-to-item ratio depends on the evaluation metric.
Reproducible results are achievable with modest annotation budgets.

Method

A simulator based on real-world subjective datasets (e.g., toxicity) was used to test thousands of combinations of item scale (*N*) and raters per item (*K*) to find statistically reliable configurations.

In practice

Use >10 raters per item for capturing human nuance.
Prioritize more items for majority-vote accuracy metrics.
Prioritize more raters per item for capturing opinion variation.

Topics

AI Benchmarking
Human Disagreement
Reproducibility
Rater Optimization
Evaluation Framework

Code references

Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.