When LLMs get significantly worse: A statistical approach to detect model degradations

2026-02-12 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

Amazon researchers propose a statistically sound hypothesis testing framework to detect Large Language Model (LLM) accuracy degradations, even those as small as 0.3%. This framework, based on McNemar's test, addresses the challenge that LLM generations are not robust to theoretically lossless optimizations due to numerical errors, leading to correlated accuracy estimates. The core insight is to compare per-sample scores rather than aggregated task-level metrics, focusing on the "degradation probability." They introduce three aggregation approaches (Pooled, Max Drop, Fisher's Method) for combining results across multiple benchmarks and provide an open-source implementation within the LM Evaluation Harness. Case studies on Llama-3.1 8B Instruct, Llama-3.3 70B Instruct, and Mistral Small 3.1 demonstrate the method's ability to flag degraded models (e.g., INT4 quantized variants, KV-FP8) while correctly identifying lossless optimizations, outperforming naive accuracy difference or flip probability thresholds.

Key takeaway

For MLOps Engineers and AI Researchers evaluating LLM optimizations, relying solely on aggregate accuracy differences or flip probabilities is insufficient and can lead to missed degradations or false positives. You should adopt this statistically rigorous McNemar's test-based framework, especially for subtle accuracy shifts. Integrating the provided open-source tool with your evaluation pipelines will ensure reliable detection of model quality changes, even for theoretically lossless optimizations that introduce numerical errors.

Key insights

A statistical framework using McNemar's test reliably detects LLM accuracy degradation by analyzing per-sample score differences.

Principles

Focus on degradation probability, not just flip probability.
Correlated accuracy estimates require specialized statistical tests.
Discarding non-flipping examples increases test efficiency.

Method

The framework uses an exact one-sided McNemar's test on per-sample scores, aggregated across tasks via Pooled, Max Drop, or Fisher's methods, to control false positive rates and detect significant accuracy drops.

In practice

Use the provided LM Evaluation Harness script for LLM degradation testing.
Consider trimming datasets to only include examples likely to flip.
Apply permutation-based tests for non-binary or continuous evaluation metrics.

Topics

LLM Degradation Detection
McNemar's Test
Model Quantization
Statistical Hypothesis Testing
LLM Evaluation Benchmarks

Code references

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.