Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study investigated whether instruction-tuned Large Language Model (LLM) labels can replace human labels in active learning (AL) loops and if AL remains necessary when entire corpora can be LLM-labeled. Researchers used a new dataset of 277,902 German political TikTok comments, with 25,974 LLM-labeled and 5,000 human-annotated instances, to detect anti-immigrant hostility. Comparing seven annotation strategies across four encoders, they found that a classifier trained on 25,974 GPT-5.2 labels, costing $43, achieved F1-Macro scores comparable to one trained on 3,800 human annotations, costing $316. Active learning showed minimal advantage over random sampling in their pre-enriched pool and yielded lower F1 than full LLM annotation at equivalent cost. However, LLM-trained classifiers systematically over-predicted the positive class, particularly in ambiguous discussions, indicating that error profile, not just aggregate F1, should guide annotation strategy.

Key takeaway

For AI Engineers building content moderation systems, relying solely on aggregate F1 scores when using LLM-generated labels for training can be misleading. You should thoroughly analyze the error profiles, especially for nuanced or ambiguous categories, to understand systematic biases like over-prediction of positive classes. This ensures your model's performance aligns with the specific requirements and acceptable error types for your target application, preventing unintended consequences in deployment.

Key insights

LLM labels can match human-level F1 scores for hostility detection but exhibit distinct error profiles.

Principles

Aggregate F1 alone is insufficient for evaluating annotation strategies.
LLM annotation cost-effectively scales data labeling.

Method

The study compared seven annotation strategies and four encoders on a German TikTok comment dataset, using 25,974 LLM labels and 5,000 human labels to detect anti-immigrant hostility.

In practice

Consider LLM labeling for large-scale data annotation.
Analyze error profiles beyond F1 for LLM-labeled datasets.

Topics

Active Learning
LLM Annotation
Hostility Detection
German TikTok Comments
Error Analysis

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.