Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation
Summary
A study investigated the validity of using Large Language Model (LLM) judges as proxies for human readers in evaluating LLM-generated disinformation. Researchers compared eight frontier LLM judges against 2,043 human ratings from 392 participants on 290 deceptive articles. The evaluation focused on three dimensions: overall scoring, item-level ordering, and textual signal dependence. Findings revealed persistent gaps, with LLM judges generally harsher than humans, weakly recovering human item rankings, and relying on different textual signals. Specifically, judges overweighted logical rigor and penalized emotional intensity more strongly than human readers. Despite strong internal agreement among LLM judges (average judge-judge rank alignment of 0.81 for credibility and 0.69 for sharing), their alignment with human responses was significantly lower (0.45 for credibility and 0.24 for sharing). Prompt variations did not close this judge-human gap.
Key takeaway
For AI Scientists and AI Ethicists developing or deploying LLM-based evaluation systems for disinformation, you must critically assess the proxy validity of LLM judges against actual human perception. Relying solely on LLM judges can lead to systems optimizing for non-human signals, potentially misrepresenting real-world propagation risk. Integrate human-grounded evaluations to ensure your models accurately reflect how deceptive content impacts human readers, especially concerning credibility and willingness to share.
Key insights
LLM judges misalign with human readers in evaluating disinformation, overemphasizing logic and underestimating emotional impact.
Principles
- Internal judge agreement does not imply human proxy validity.
- Human-facing evaluation requires human-grounded validation.
- LLM judges act as analytical screeners, not human readers.
Method
The study audited LLM judges against human reader responses using 290 goal-directed deceptive articles and 2,043 paired human ratings, comparing overall scoring, item-level ordering, and textual signal dependence.
In practice
- Validate LLM-based evaluations against human responses.
- Consider human perception for disinformation risk assessment.
- Avoid sole reliance on LLM judges for audience-facing tasks.
Topics
- LLM Judges
- Disinformation Evaluation
- Human-Grounded Evaluation
- Proxy Validity
- Credibility Assessment
Best for: AI Scientist, AI Ethicist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.