Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Expert, extended

Summary

A study investigated the validity of using Large Language Model (LLM) judges as proxies for human readers in evaluating LLM-generated disinformation. Researchers compared eight frontier LLM judges against 2,043 human ratings from 392 participants on 290 deceptive articles. The evaluation focused on three dimensions: overall scoring, item-level ordering, and textual signal dependence. Findings revealed persistent gaps, with LLM judges generally harsher than humans, weakly recovering human item rankings, and relying on different textual signals. Specifically, judges overweighted logical rigor and penalized emotional intensity more strongly than human readers. Despite strong internal agreement among LLM judges (average judge-judge rank alignment of 0.81 for credibility and 0.69 for sharing), their alignment with human responses was significantly lower (0.45 for credibility and 0.24 for sharing). Prompt variations did not close this judge-human gap.

Key takeaway

For AI Scientists and AI Ethicists developing or deploying LLM-based evaluation systems for disinformation, you must critically assess the proxy validity of LLM judges against actual human perception. Relying solely on LLM judges can lead to systems optimizing for non-human signals, potentially misrepresenting real-world propagation risk. Integrate human-grounded evaluations to ensure your models accurately reflect how deceptive content impacts human readers, especially concerning credibility and willingness to share.

Key insights

LLM judges misalign with human readers in evaluating disinformation, overemphasizing logic and underestimating emotional impact.

Principles

Method

The study audited LLM judges against human reader responses using 290 goal-directed deceptive articles and 2,043 paired human ratings, comparing overall scoring, item-level ordering, and textual signal dependence.

In practice

Topics

Best for: AI Scientist, AI Ethicist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.