Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new diagnostic toolkit has been developed to assess the per-instance reliability of LLM-as-judge frameworks, which are widely used for automatic Natural Language Generation (NLG) evaluation. Applied to SummEval, the toolkit reveals significant per-input inconsistency through a transitivity analysis, showing that 33-67% of documents contain at least one directed 3-cycle, despite low aggregate violation rates of 0.8-4.1%. Additionally, split conformal prediction sets over 1-5 Likert scores offer theoretically guaranteed coverage of ≥(1-α), with set width serving as a per-instance reliability indicator (r_s = +0.576, N=1,918, p < 10^-100). This set width consistently agrees across judges (ř = 0.32-0.38), indicating it measures document-level difficulty rather than judge-specific noise. The diagnostics show that evaluation criterion is more critical than the specific LLM judge, with relevance being the most reliably judged (average set size ≈ 3.0) and coherence moderately so (≈ 3.9), while fluency and consistency remain unreliable (≈ 4.9).

Key takeaway

For AI Engineers evaluating NLG models with LLM-as-judge frameworks, you should integrate per-instance reliability diagnostics like transitivity analysis and conformal prediction sets. Pay close attention to the evaluation criteria, as relevance judgments are significantly more reliable than those for fluency or consistency. Use prediction set width to identify difficult documents and avoid over-reliance on aggregate metrics that can obscure per-input inconsistencies.

Key insights

LLM judge reliability varies significantly by evaluation criterion and input difficulty, not just the judge itself.

Principles

Method

The toolkit combines transitivity analysis to detect inconsistencies and split conformal prediction sets to quantify per-instance reliability via set width over Likert scores.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.