LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

2026-06-14 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new "Judge Datasheet protocol" is introduced for evaluating LLM-as-a-judge systems, advocating for their assessment as measurement instruments rather than simple scalar metrics like accuracy or win-rate. This protocol measures critical aspects such as "dark current" under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, and target sensitivity on a controlled quality ladder. A case study involving Llama-3.1-8B, Qwen2.5-14B, and Qwen2.5-32B demonstrated the protocol's utility. Llama-3.1-8B exhibited high dark current, while Qwen2.5-14B was vacuum-clean but showed mixed stable and positional over-discrimination. Qwen2.5-32B proved vacuum-clean with low stable cross-sensitivity and minimal positional false preference. The research also highlights that prompting primarily shifts the evaluation criterion, not the underlying resolution.

Key takeaway

For Machine Learning Engineers evaluating LLM-as-a-judge systems, you must move beyond scalar metrics and adopt a psychometric approach. Implement a Judge Datasheet protocol to characterize your LLM judge's "dark current," cross-sensitivity, and positional false preference. Understanding these measurement instrument properties will prevent misinterpreting apparent preferences and ensure your evaluations accurately reflect model quality, rather than judge biases or prompt-induced criterion shifts.

Key insights

LLM-as-a-judge systems require psychometric evaluation as measurement instruments to understand their true biases and sensitivities.

Principles

LLM judges should be reported as measurement instruments.
Apparent Delta0 preference can mask position bias.
Prompting shifts evaluation criterion, not resolution.

Method

The Judge Datasheet protocol measures dark current, cross-sensitivity, positional false preference, and target sensitivity. It also analyzes the criterion induced by tie instructions and decomposes direction-stability.

In practice

Evaluate LLM judges for dark current and positional bias.
Analyze tie instructions' impact on evaluation criteria.
Decompose Delta0 preference for stability and bias.

Topics

LLM-as-a-Judge
Model Evaluation
Psychometric Protocol
Judge Datasheet
Llama-3.1-8B
Qwen2.5 Models

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.