Building better AI benchmarks: How many raters are enough?

· Source: The latest research from Google · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Google Research scientists Flip Korn and Chris Welty introduced an evaluation framework for machine learning models that optimizes the trade-off between the number of items and raters per item, aiming for highly reproducible AI benchmarks. Published on March 31, 2026, their paper, "Forest vs Tree: The (*N*,*K*) Trade-off in Reproducible ML Evaluation," challenges the common practice of using 1-5 raters per item, which often fails to capture human disagreement. They developed an open-source simulator to stress-test various combinations of item scale (*N*, from 100 to 50,000) and crowd size (*K*, from 1 to 500 raters per item) across datasets like Toxicity, DICES, D3code, and Jobs. Key findings indicate that 3-5 raters are often insufficient, and the optimal rater-to-item ratio depends on the evaluation metric: "forest" (more items, fewer raters) for majority vote accuracy, and "tree" (fewer items, more raters) for capturing the full range of human opinion and nuance. The research suggests reproducible results are achievable with around 1,000 total annotations if optimized correctly.

Key takeaway

For AI Engineers designing evaluation benchmarks for subjective tasks, you should re-evaluate your current rater-per-item strategy. If your goal is to capture the full spectrum of human opinion rather than just a majority vote, increase your raters per item beyond the typical 3-5. Leveraging the open-source simulator can help you optimize your annotation budget for reproducibility, ensuring your models are evaluated against a more realistic and nuanced "ground truth."

Key insights

Optimizing rater-to-item ratios is crucial for reproducible AI benchmarks that capture human disagreement.

Principles

Method

A simulator based on real-world subjective datasets (e.g., toxicity) was used to test thousands of combinations of item scale (*N*) and raters per item (*K*) to find statistically reliable configurations.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.