Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new method, detailed in a paper titled "Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines", resolves ambiguity in LLM product evaluation pipelines where drift alarms could signal either a worse product or a changed LLM judge. The proposed system uses a fixed, human-labeled anchor set re-scored by the current judge, a betting e-process, and a guard-window rule to attribute drift. This approach achieves anytime-validity and one-way identification, ensuring only the judge can affect anchors. Experiments show it detects silent judge version bumps in 60/60 runs with zero misattribution and correctly attributes strict-prompt changes in 110/120 or 240/240 runs, significantly outperforming the industry-default rolling z-test which false-alarms 75% of the time. The monitor also reduces costs by approximately 0.64x or 0.21x compared to strong-judging every item.

Key takeaway

For MLOps Engineers managing LLM evaluation pipelines, implementing this anytime-valid attribution system can eliminate ambiguity in drift alarms. Your team will gain clear insights into whether performance degradation stems from the product or the LLM judge itself, drastically reducing false positives from judge changes. This allows for more targeted interventions and prevents wasted effort investigating non-existent product issues.

Key insights

A novel method attributes LLM evaluation drift to either the product or the judge using an anchor set and statistical process.

Principles

Anchors must out-run the main process they guard.
Only the judge can move the anchors (one-way identification).

Method

Re-score a fixed human-labeled anchor set with the current judge, apply a second betting e-process, and use a guard-window rule to return a verdict.

In practice

Detect silent LLM judge version bumps.
Accurately attribute prompt changes.
Reduce false alarms in drift detection.

Topics

LLM Evaluation
Model Drift
Attribution
LLM Judges
Continuous Evaluation
MLOps

Best for: AI Scientist, Research Scientist, MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.