Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles
Summary
The article investigates whether topic sentiment causally affects perceived political ideology in news articles, comparing human expert annotations from AllSides with those from Llama-3.3-70b, baseline GPT-4o-mini, and fine-tuned GPT-4o-mini. Using Double Machine Learning (DML) and mediation analysis on an N=1,265 article dataset, the study found that human annotations showed no significant causal effects of topic sentiment on ideology at the community level. In contrast, fine-tuned GPT-4o-mini, which achieved the highest classification accuracy (F1=72.48), was the only annotator paradigm to produce significant community-level treatment effects and natural direct effects (NDEs). This suggests fine-tuning can lead to "shortcut learning," where models internalize a spurious sentiment-ideology coupling not present in human judgment, a difference invisible to standard F1-based evaluation. The findings highlight implications for using LLM annotations as "silver labels" in downstream causal analyses.
Key takeaway
For AI Scientists or Research Scientists planning to use LLMs for social science annotation tasks, you should critically evaluate models beyond standard accuracy metrics like F1 score. Your LLM's high F1 might mask "shortcut learning," where it develops spurious causal links between sentiment and ideology that human annotators do not exhibit. Implement causal analysis frameworks, such as mediation analysis, to audit your LLM's annotation process and ensure its causal fidelity aligns with human judgment, especially for downstream causal inference studies.
Key insights
Fine-tuning LLMs for ideology prediction can create spurious sentiment-ideology causal links not present in human judgment.
Principles
- Output-level accuracy does not guarantee causal fidelity.
- LLM fine-tuning can introduce shortcut learning.
- Causal analysis reveals hidden annotation divergences.
Method
Compare human and LLM ideology labels while holding Llama-3.3-70b-versatile sentiment annotations constant. Apply Double Machine Learning and mediation analysis to identify causal effects of topic sentiment.
In practice
- Use causal analysis to audit LLM annotators.
- Avoid LLM silver labels for causal studies.
- Evaluate LLMs beyond F1 for critical tasks.
Topics
- LLM Annotation Fidelity
- Causal Inference
- Political Ideology Classification
- Sentiment Analysis
- Double Machine Learning
- Shortcut Learning
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.