Metric-Dependent Annotation Saturation for Learning from Label Distributions

2026-06-23 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A study on metric-dependent annotation saturation reveals that the number of annotators required to capture disagreement signal varies significantly based on the evaluation metric. Researchers fine-tuned NLI models using label distributions from ChaosNLI, a dataset with 100 annotator judgments per item. They found that entropy correlation, which measures a model's ability to identify items eliciting disagreement, converged with N ≈ 20–50 annotators. In contrast, distributional match (KL divergence) saturated much earlier, by N ≈ 10, achieving 87–95% of potential improvement across five model seeds. This work also confirms that soft labels provide item-specific signal superior to label smoothing, with soft labels reaching r = 0.643 (p < 0.001) compared to r ≈ 0.45–0.49 for smoothing. This advantage was consistent across DeBERTa and RoBERTa architectures and a cross-domain content safety evaluation.

Key takeaway

For Machine Learning Engineers managing data annotation projects, recognize that your annotation budget should directly align with your target evaluation metric. If your goal is to capture nuanced disagreement (e.g., via entropy correlation), you will need more annotators (N ≈ 20–50) than if you prioritize distributional match (N ≈ 10). Furthermore, prioritize using soft labels over label smoothing to preserve critical item-specific signal, especially for ambiguous cases.

Key insights

Annotation saturation is metric-dependent, with soft labels outperforming label smoothing for capturing disagreement signal.

Principles

Disagreement signal capture is metric-dependent
Soft labels preserve item-specific signal
Label smoothing cannot distinguish ambiguous items

Method

Fine-tune NLI models on subsampled ChaosNLI label distributions, evaluating convergence using entropy correlation and KL divergence.

In practice

Inform annotation budgets by target metric
Prefer soft labels over label smoothing

Topics

Annotation Saturation
Label Distributions
NLI Models
ChaosNLI
Soft Labels
Evaluation Metrics
Label Smoothing

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.