When LLMs get significantly worse: A statistical approach to detect model degradations
Summary
Amazon researchers propose a statistically sound hypothesis testing framework to detect Large Language Model (LLM) accuracy degradations, even those as small as 0.3%. This framework, based on McNemar's test, addresses the challenge that LLM generations are not robust to theoretically lossless optimizations due to numerical errors, leading to correlated accuracy estimates. The core insight is to compare per-sample scores rather than aggregated task-level metrics, focusing on the "degradation probability." They introduce three aggregation approaches (Pooled, Max Drop, Fisher's Method) for combining results across multiple benchmarks and provide an open-source implementation within the LM Evaluation Harness. Case studies on Llama-3.1 8B Instruct, Llama-3.3 70B Instruct, and Mistral Small 3.1 demonstrate the method's ability to flag degraded models (e.g., INT4 quantized variants, KV-FP8) while correctly identifying lossless optimizations, outperforming naive accuracy difference or flip probability thresholds.
Key takeaway
For MLOps Engineers and AI Researchers evaluating LLM optimizations, relying solely on aggregate accuracy differences or flip probabilities is insufficient and can lead to missed degradations or false positives. You should adopt this statistically rigorous McNemar's test-based framework, especially for subtle accuracy shifts. Integrating the provided open-source tool with your evaluation pipelines will ensure reliable detection of model quality changes, even for theoretically lossless optimizations that introduce numerical errors.
Key insights
A statistical framework using McNemar's test reliably detects LLM accuracy degradation by analyzing per-sample score differences.
Principles
- Focus on degradation probability, not just flip probability.
- Correlated accuracy estimates require specialized statistical tests.
- Discarding non-flipping examples increases test efficiency.
Method
The framework uses an exact one-sided McNemar's test on per-sample scores, aggregated across tasks via Pooled, Max Drop, or Fisher's methods, to control false positive rates and detect significant accuracy drops.
In practice
- Use the provided LM Evaluation Harness script for LLM degradation testing.
- Consider trimming datasets to only include examples likely to flip.
- Apply permutation-based tests for non-binary or continuous evaluation metrics.
Topics
- LLM Degradation Detection
- McNemar's Test
- Model Quantization
- Statistical Hypothesis Testing
- LLM Evaluation Benchmarks
Code references
Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.