LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation
Summary
A new "Judge Datasheet protocol" is introduced for evaluating LLM-as-a-judge systems, advocating for their assessment as measurement instruments rather than simple scalar metrics like accuracy or win-rate. This protocol measures critical aspects such as "dark current" under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, and target sensitivity on a controlled quality ladder. A case study involving Llama-3.1-8B, Qwen2.5-14B, and Qwen2.5-32B demonstrated the protocol's utility. Llama-3.1-8B exhibited high dark current, while Qwen2.5-14B was vacuum-clean but showed mixed stable and positional over-discrimination. Qwen2.5-32B proved vacuum-clean with low stable cross-sensitivity and minimal positional false preference. The research also highlights that prompting primarily shifts the evaluation criterion, not the underlying resolution.
Key takeaway
For Machine Learning Engineers evaluating LLM-as-a-judge systems, you must move beyond scalar metrics and adopt a psychometric approach. Implement a Judge Datasheet protocol to characterize your LLM judge's "dark current," cross-sensitivity, and positional false preference. Understanding these measurement instrument properties will prevent misinterpreting apparent preferences and ensure your evaluations accurately reflect model quality, rather than judge biases or prompt-induced criterion shifts.
Key insights
LLM-as-a-judge systems require psychometric evaluation as measurement instruments to understand their true biases and sensitivities.
Principles
- LLM judges should be reported as measurement instruments.
- Apparent Delta0 preference can mask position bias.
- Prompting shifts evaluation criterion, not resolution.
Method
The Judge Datasheet protocol measures dark current, cross-sensitivity, positional false preference, and target sensitivity. It also analyzes the criterion induced by tie instructions and decomposes direction-stability.
In practice
- Evaluate LLM judges for dark current and positional bias.
- Analyze tie instructions' impact on evaluation criteria.
- Decompose Delta0 preference for stability and bias.
Topics
- LLM-as-a-Judge
- Model Evaluation
- Psychometric Protocol
- Judge Datasheet
- Llama-3.1-8B
- Qwen2.5 Models
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.