Metric-Dependent Annotation Saturation for Learning from Label Distributions
Summary
A study on metric-dependent annotation saturation reveals that the number of annotators required to capture disagreement signal varies significantly based on the evaluation metric. Researchers fine-tuned NLI models using label distributions from ChaosNLI, a dataset with 100 annotator judgments per item. They found that entropy correlation, which measures a model's ability to identify items eliciting disagreement, converged with N ≈ 20–50 annotators. In contrast, distributional match (KL divergence) saturated much earlier, by N ≈ 10, achieving 87–95% of potential improvement across five model seeds. This work also confirms that soft labels provide item-specific signal superior to label smoothing, with soft labels reaching r = 0.643 (p < 0.001) compared to r ≈ 0.45–0.49 for smoothing. This advantage was consistent across DeBERTa and RoBERTa architectures and a cross-domain content safety evaluation.
Key takeaway
For Machine Learning Engineers managing data annotation projects, recognize that your annotation budget should directly align with your target evaluation metric. If your goal is to capture nuanced disagreement (e.g., via entropy correlation), you will need more annotators (N ≈ 20–50) than if you prioritize distributional match (N ≈ 10). Furthermore, prioritize using soft labels over label smoothing to preserve critical item-specific signal, especially for ambiguous cases.
Key insights
Annotation saturation is metric-dependent, with soft labels outperforming label smoothing for capturing disagreement signal.
Principles
- Disagreement signal capture is metric-dependent
- Soft labels preserve item-specific signal
- Label smoothing cannot distinguish ambiguous items
Method
Fine-tune NLI models on subsampled ChaosNLI label distributions, evaluating convergence using entropy correlation and KL divergence.
In practice
- Inform annotation budgets by target metric
- Prefer soft labels over label smoothing
Topics
- Annotation Saturation
- Label Distributions
- NLI Models
- ChaosNLI
- Soft Labels
- Evaluation Metrics
- Label Smoothing
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.