Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A large-scale assessment of human annotation reporting in Natural Language Processing (NLP) between 2018 and 2025 reveals significant gaps in documentation. This study, covering 2,667 annotation tasks from 1,603 ACL-venue papers, utilized an LLM-assisted extraction pipeline validated against a human-adjudicated gold standard, achieving a Krippendorff's alpha of 0.606, comparable to human-human agreement (0.585). Researchers found that while operational details like recruitment strategies, annotator expertise, and annotation volume are frequently reported, crucial information for assessing annotation validity—such as training, language proficiency, compensation, socio-demographics, adjudication, and agreement values—is often omitted, particularly in model-evaluation studies. Although reporting practices have improved over time, they remain inconsistent. The work introduces a unified taxonomy and provides bare-minimum reporting recommendations to enhance the reliability, reproducibility, and interpretability of human annotation in NLP.

Key takeaway

For NLP Engineers and Data Scientists relying on human-annotated datasets, ensure your projects rigorously document annotation processes. Your ability to assess data validity and reproduce results is compromised when details like annotator training, language proficiency, compensation, and agreement values are omitted. Adopt bare-minimum reporting recommendations to enhance the reliability and interpretability of your human-labeled data, making your model evaluations more robust and transparent.

Key insights

NLP human annotation reporting often lacks critical validity details, hindering reliability.

Principles

Annotation reporting is uneven across NLP research.
Validity details are frequently omitted.
Standardized reporting improves reproducibility.

Method

An LLM-assisted pipeline extracts annotation details using a unified taxonomy. It was validated against a human-adjudicated gold standard, then applied to 1,603 ACL-venue papers (2018-2025) to construct Annotated-llm.

In practice

Use the proposed annotation-reporting taxonomy.
Implement bare-minimum reporting recommendations.
Document annotator training and compensation.

Topics

Human Annotation
NLP Data Quality
Reporting Standards
LLM-assisted Extraction
Model Evaluation
Research Reproducibility

Best for: Research Scientist, AI Scientist, NLP Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.