Metric-Dependent Annotation Saturation for Learning from Label Distributions
Summary
A study on metric-dependent annotation saturation reveals that the optimal number of human annotators required to capture signal from label disagreements varies significantly based on the chosen evaluation metric. Researchers fine-tuned Natural Language Inference (NLI) models using label distributions subsampled from ChaosNLI, a dataset featuring 100 independent annotator judgments per item. In a 3-class NLI task, achieving convergence for entropy correlation, which measures a model's ability to identify items eliciting disagreement, necessitated N ~ 20-50 annotators. Conversely, distributional match, assessed via KL divergence, saturated much earlier, requiring only N ~ 10 annotators to achieve 87-95% of potential improvement across five model seeds. This work also confirms that soft labels provide item-specific signal superior to label smoothing, with soft labels reaching an entropy correlation of r = 0.643 compared to r ~ 0.45-0.49 for smoothing. This advantage holds across DeBERTa and RoBERTa architectures and a cross-domain content safety evaluation.
Key takeaway
For Machine Learning Engineers designing data annotation strategies, you should critically assess your target evaluation metric before setting annotation budgets. This research demonstrates that metrics like KL divergence saturate with fewer annotators (N ~ 10) than entropy correlation (N ~ 20-50), directly impacting resource allocation. Prioritize using soft labels over traditional label smoothing to better capture nuanced item-specific signals, especially for ambiguous cases, improving model performance and robustness.
Key insights
Annotation budget needs depend on the target evaluation metric, as different metrics saturate at varying annotator counts.
Principles
- Annotator disagreement provides valuable signal.
- Soft labels outperform label smoothing for ambiguity.
- Evaluation metrics dictate annotation saturation points.
Method
Fine-tune NLI models on label distributions subsampled from a dataset with 100 annotator judgments per item, then identify metric-dependent saturation points.
In practice
- Align annotation budgets with target metrics.
- Prioritize soft labels over smoothing for NLI.
- Test saturation for specific evaluation goals.
Topics
- Annotation Saturation
- Label Distributions
- Natural Language Inference
- Soft Labels
- Evaluation Metrics
- ChaosNLI
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.