Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines
Summary
A new method, detailed in a paper titled "Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines", resolves ambiguity in LLM product evaluation pipelines where drift alarms could signal either a worse product or a changed LLM judge. The proposed system uses a fixed, human-labeled anchor set re-scored by the current judge, a betting e-process, and a guard-window rule to attribute drift. This approach achieves anytime-validity and one-way identification, ensuring only the judge can affect anchors. Experiments show it detects silent judge version bumps in 60/60 runs with zero misattribution and correctly attributes strict-prompt changes in 110/120 or 240/240 runs, significantly outperforming the industry-default rolling z-test which false-alarms 75% of the time. The monitor also reduces costs by approximately 0.64x or 0.21x compared to strong-judging every item.
Key takeaway
For MLOps Engineers managing LLM evaluation pipelines, implementing this anytime-valid attribution system can eliminate ambiguity in drift alarms. Your team will gain clear insights into whether performance degradation stems from the product or the LLM judge itself, drastically reducing false positives from judge changes. This allows for more targeted interventions and prevents wasted effort investigating non-existent product issues.
Key insights
A novel method attributes LLM evaluation drift to either the product or the judge using an anchor set and statistical process.
Principles
- Anchors must out-run the main process they guard.
- Only the judge can move the anchors (one-way identification).
Method
Re-score a fixed human-labeled anchor set with the current judge, apply a second betting e-process, and use a guard-window rule to return a verdict.
In practice
- Detect silent LLM judge version bumps.
- Accurately attribute prompt changes.
- Reduce false alarms in drift detection.
Topics
- LLM Evaluation
- Model Drift
- Attribution
- LLM Judges
- Continuous Evaluation
- MLOps
Best for: AI Scientist, Research Scientist, MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.