Metric-Dependent Annotation Saturation for Learning from Label Distributions

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A study on metric-dependent annotation saturation reveals that the optimal number of human annotators required to capture signal from label disagreements varies significantly based on the chosen evaluation metric. Researchers fine-tuned Natural Language Inference (NLI) models using label distributions subsampled from ChaosNLI, a dataset featuring 100 independent annotator judgments per item. In a 3-class NLI task, achieving convergence for entropy correlation, which measures a model's ability to identify items eliciting disagreement, necessitated N ~ 20-50 annotators. Conversely, distributional match, assessed via KL divergence, saturated much earlier, requiring only N ~ 10 annotators to achieve 87-95% of potential improvement across five model seeds. This work also confirms that soft labels provide item-specific signal superior to label smoothing, with soft labels reaching an entropy correlation of r = 0.643 compared to r ~ 0.45-0.49 for smoothing. This advantage holds across DeBERTa and RoBERTa architectures and a cross-domain content safety evaluation.

Key takeaway

For Machine Learning Engineers designing data annotation strategies, you should critically assess your target evaluation metric before setting annotation budgets. This research demonstrates that metrics like KL divergence saturate with fewer annotators (N ~ 10) than entropy correlation (N ~ 20-50), directly impacting resource allocation. Prioritize using soft labels over traditional label smoothing to better capture nuanced item-specific signals, especially for ambiguous cases, improving model performance and robustness.

Key insights

Annotation budget needs depend on the target evaluation metric, as different metrics saturate at varying annotator counts.

Principles

Method

Fine-tune NLI models on label distributions subsampled from a dataset with 100 annotator judgments per item, then identify metric-dependent saturation points.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.