Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Summary
A large-scale assessment of human annotation reporting in Natural Language Processing (NLP) between 2018 and 2025 reveals significant gaps in documentation. This study, covering 2,667 annotation tasks from 1,603 ACL-venue papers, utilized an LLM-assisted extraction pipeline validated against a human-adjudicated gold standard, achieving a Krippendorff's alpha of 0.606, comparable to human-human agreement (0.585). Researchers found that while operational details like recruitment strategies, annotator expertise, and annotation volume are frequently reported, crucial information for assessing annotation validity—such as training, language proficiency, compensation, socio-demographics, adjudication, and agreement values—is often omitted, particularly in model-evaluation studies. Although reporting practices have improved over time, they remain inconsistent. The work introduces a unified taxonomy and provides bare-minimum reporting recommendations to enhance the reliability, reproducibility, and interpretability of human annotation in NLP.
Key takeaway
For NLP Engineers and Data Scientists relying on human-annotated datasets, ensure your projects rigorously document annotation processes. Your ability to assess data validity and reproduce results is compromised when details like annotator training, language proficiency, compensation, and agreement values are omitted. Adopt bare-minimum reporting recommendations to enhance the reliability and interpretability of your human-labeled data, making your model evaluations more robust and transparent.
Key insights
NLP human annotation reporting often lacks critical validity details, hindering reliability.
Principles
- Annotation reporting is uneven across NLP research.
- Validity details are frequently omitted.
- Standardized reporting improves reproducibility.
Method
An LLM-assisted pipeline extracts annotation details using a unified taxonomy. It was validated against a human-adjudicated gold standard, then applied to 1,603 ACL-venue papers (2018-2025) to construct Annotated-llm.
In practice
- Use the proposed annotation-reporting taxonomy.
- Implement bare-minimum reporting recommendations.
- Document annotator training and compensation.
Topics
- Human Annotation
- NLP Data Quality
- Reporting Standards
- LLM-assisted Extraction
- Model Evaluation
- Research Reproducibility
Best for: Research Scientist, AI Scientist, NLP Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.