SCOPE: Selective Conformal Optimized Pairwise LLM Judging

2026-02-13 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

SCOPE (Selective Conformal Optimized Pairwise Evaluation) is a new framework designed to improve the reliability of large language models (LLMs) when used as judges for pairwise evaluation, addressing their common issues of miscalibration and systematic biases. This framework provides finite-sample statistical guarantees by calibrating an acceptance threshold to ensure the error rate among non-abstained judgments does not exceed a user-specified level \$α\$. SCOPE introduces Bidirectional Preference Entropy (BPE), a novel uncertainty signal that queries the LLM judge under both response positions, aggregates preference probabilities to ensure invariance to response order, and converts this into an entropy-based uncertainty score. Evaluations across MT-Bench, RewardBench, and Chatbot Arena demonstrate that BPE enhances uncertainty quality, allowing SCOPE to consistently meet target risk levels, such as \$α=0.10\$ (empirical risk \$\approx 0.097\$ to \$0.099\$), while maintaining high coverage, reaching up to 0.98 on RewardBench with Qwen-32B. SCOPE accepts up to 2.4\$\times\$ more judgments than naive baselines on MT-Bench with Qwen-7B under the same risk constraint.

Key takeaway

For research scientists evaluating LLMs using pairwise comparisons, SCOPE offers a statistically guaranteed method to improve judgment reliability and coverage. You should consider integrating SCOPE and its Bidirectional Preference Entropy (BPE) to mitigate LLM judge biases and miscalibration, ensuring your evaluations meet a specified error rate while maximizing the number of usable judgments. This approach can significantly enhance the trustworthiness of LLM-based evaluation pipelines.

Key insights

SCOPE enhances LLM judge reliability in pairwise evaluation using selective judging and a novel bias-neutral uncertainty signal.

Principles

Calibrate acceptance thresholds for statistical guarantees.
Ensure invariance to response order in LLM judging.
Entropy-based scores improve uncertainty quality.

Method

SCOPE calibrates an acceptance threshold for selective judging. Bidirectional Preference Entropy (BPE) queries LLMs bidirectionally, aggregates probabilities for order invariance, and converts to an entropy-based uncertainty score.

In practice

Use BPE for robust LLM judge uncertainty.
Apply SCOPE to meet target error rates.
Evaluate LLMs with Qwen-14B/32B for high coverage.

Topics

LLM Judging
Pairwise Evaluation
Conformal Prediction
Bidirectional Preference Entropy
Model Calibration

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.