Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Social Sciences & Behavioral Studies · Depth: Expert, extended

Summary

A study investigated whether Large Language Model (LLM) labels can replace human labels in active learning (AL) for hostility detection and if AL remains necessary when LLMs can label entire corpora. Researchers used a new dataset of 277,902 German political TikTok comments, with 25,974 LLM-labelled and 5,000 human-annotated instances. They compared seven annotation strategies across four encoder models to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels, costing $43, achieved comparable F1-Macro scores to one trained on 3,800 human annotations, which cost $316. Active learning offered minimal advantage over random sampling in their pre-enriched pool and yielded lower F1 than full LLM annotation at the same cost. However, LLM-trained classifiers systematically over-predicted the positive class, particularly in topically ambiguous discussions where distinguishing anti-immigrant hostility from policy critique was subtle, suggesting that error profiles, not just aggregate F1, should guide annotation strategy.

Key takeaway

For AI Engineers building content moderation systems, relying solely on LLM-generated labels for hostility detection may introduce a systematic over-prediction of the positive class, especially in nuanced political discourse. You should conduct a thorough error analysis, focusing on false positive rates and confidence distributions, to ensure the chosen annotation strategy aligns with the application's acceptable error profile, even if aggregate F1 scores appear comparable to human-labeled data.

Key insights

LLM annotation at scale matches human aggregate performance at lower cost, but with distinct error profiles.

Principles

Label quality can be compensated by data volume.
Aggregate F1 alone is insufficient for annotation strategy.
Pool construction impacts active learning benefits.

Method

A two-stage LLM pipeline (Llama-3.3-70B prefiltering, GPT-5.2 classification) was used to construct a large annotation pool, followed by human annotation and comparison across active learning and full-pool conditions.

In practice

Use LLMs for large-scale, cost-effective data labeling.
Prioritize robust pool construction over complex AL acquisition.
Analyze error profiles for critical applications.

Topics

Active Learning
LLM Annotation
Hostility Detection
German TikTok Comments
Error Profile Analysis

Best for: AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.