Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
Summary
A study investigated whether instruction-tuned Large Language Model (LLM) labels can replace human labels in active learning (AL) loops and if AL remains necessary when entire corpora can be LLM-labeled. Researchers used a new dataset of 277,902 German political TikTok comments, with 25,974 LLM-labeled and 5,000 human-annotated instances, to detect anti-immigrant hostility. Comparing seven annotation strategies across four encoders, they found that a classifier trained on 25,974 GPT-5.2 labels, costing $43, achieved F1-Macro scores comparable to one trained on 3,800 human annotations, costing $316. Active learning showed minimal advantage over random sampling in their pre-enriched pool and yielded lower F1 than full LLM annotation at equivalent cost. However, LLM-trained classifiers systematically over-predicted the positive class, particularly in ambiguous discussions, indicating that error profile, not just aggregate F1, should guide annotation strategy.
Key takeaway
For AI Engineers building content moderation systems, relying solely on aggregate F1 scores when using LLM-generated labels for training can be misleading. You should thoroughly analyze the error profiles, especially for nuanced or ambiguous categories, to understand systematic biases like over-prediction of positive classes. This ensures your model's performance aligns with the specific requirements and acceptable error types for your target application, preventing unintended consequences in deployment.
Key insights
LLM labels can match human-level F1 scores for hostility detection but exhibit distinct error profiles.
Principles
- Aggregate F1 alone is insufficient for evaluating annotation strategies.
- LLM annotation cost-effectively scales data labeling.
Method
The study compared seven annotation strategies and four encoders on a German TikTok comment dataset, using 25,974 LLM labels and 5,000 human labels to detect anti-immigrant hostility.
In practice
- Consider LLM labeling for large-scale data annotation.
- Analyze error profiles beyond F1 for LLM-labeled datasets.
Topics
- Active Learning
- LLM Annotation
- Hostility Detection
- German TikTok Comments
- Error Analysis
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.