Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Summary
A new diagnostic toolkit has been developed to assess the per-instance reliability of LLM-as-judge frameworks, which are widely used for automatic Natural Language Generation (NLG) evaluation. Applied to SummEval, the toolkit reveals significant per-input inconsistency through a transitivity analysis, showing that 33-67% of documents contain at least one directed 3-cycle, despite low aggregate violation rates of 0.8-4.1%. Additionally, split conformal prediction sets over 1-5 Likert scores offer theoretically guaranteed coverage of ≥(1-α), with set width serving as a per-instance reliability indicator (r_s = +0.576, N=1,918, p < 10^-100). This set width consistently agrees across judges (ř = 0.32-0.38), indicating it measures document-level difficulty rather than judge-specific noise. The diagnostics show that evaluation criterion is more critical than the specific LLM judge, with relevance being the most reliably judged (average set size ≈ 3.0) and coherence moderately so (≈ 3.9), while fluency and consistency remain unreliable (≈ 4.9).
Key takeaway
For AI Engineers evaluating NLG models with LLM-as-judge frameworks, you should integrate per-instance reliability diagnostics like transitivity analysis and conformal prediction sets. Pay close attention to the evaluation criteria, as relevance judgments are significantly more reliable than those for fluency or consistency. Use prediction set width to identify difficult documents and avoid over-reliance on aggregate metrics that can obscure per-input inconsistencies.
Key insights
LLM judge reliability varies significantly by evaluation criterion and input difficulty, not just the judge itself.
Principles
- Per-input inconsistency is masked by aggregate metrics.
- Prediction set width indicates document-level difficulty.
- Criterion matters more than judge for reliability.
Method
The toolkit combines transitivity analysis to detect inconsistencies and split conformal prediction sets to quantify per-instance reliability via set width over Likert scores.
In practice
- Use prediction set width as a reliability metric.
- Prioritize relevance criteria for LLM-as-judge.
- Be cautious with fluency and consistency scores.
Topics
- LLM-as-Judge
- NLG Evaluation
- Conformal Prediction
- Transitivity Analysis
- SummEval
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.