LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new "Judge Datasheet protocol" is introduced for evaluating LLM-as-a-judge systems, advocating for their assessment as measurement instruments rather than simple scalar metrics like accuracy or win-rate. This protocol measures critical aspects such as "dark current" under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, and target sensitivity on a controlled quality ladder. A case study involving Llama-3.1-8B, Qwen2.5-14B, and Qwen2.5-32B demonstrated the protocol's utility. Llama-3.1-8B exhibited high dark current, while Qwen2.5-14B was vacuum-clean but showed mixed stable and positional over-discrimination. Qwen2.5-32B proved vacuum-clean with low stable cross-sensitivity and minimal positional false preference. The research also highlights that prompting primarily shifts the evaluation criterion, not the underlying resolution.

Key takeaway

For Machine Learning Engineers evaluating LLM-as-a-judge systems, you must move beyond scalar metrics and adopt a psychometric approach. Implement a Judge Datasheet protocol to characterize your LLM judge's "dark current," cross-sensitivity, and positional false preference. Understanding these measurement instrument properties will prevent misinterpreting apparent preferences and ensure your evaluations accurately reflect model quality, rather than judge biases or prompt-induced criterion shifts.

Key insights

LLM-as-a-judge systems require psychometric evaluation as measurement instruments to understand their true biases and sensitivities.

Principles

Method

The Judge Datasheet protocol measures dark current, cross-sensitivity, positional false preference, and target sensitivity. It also analyzes the criterion induced by tie instructions and decomposes direction-stability.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.