Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment
Summary
A study evaluated fidelity metrics, such as KL divergence (KLD) and Perplexity (PPL), for selecting quantized Large Language Models (LLMs). Researchers tested 28 Qwen3.6-35B-A3B and 41 Devstral-Small-2-24B quantizations. They found KLD strongly correlated with benchmark scores across the full cohort, with ρ=-0.72 on Qwen and ρ=-0.86 on Devstral. However, this relationship collapses to non-significance in the "silent zone" of near-baseline models, showing ρ=+0.00 on Qwen. This collapse persists across 14 measurement variants and at the per-prompt level. The research attributes this to KLD primarily measuring the *volume* of disagreement with a reference model, not the *direction* of those disagreements.
Key takeaway
For MLOps engineers deploying quantized LLMs, relying solely on fidelity metrics like KLD for model selection is risky. While high KLD can flag severely degraded models, it provides no reliable ranking signal among near-baseline candidates. You should prioritize comprehensive downstream task evaluation, throughput, and hardware fit for selecting production-ready quantized models.
Key insights
KL divergence tracks disagreement volume, not direction, making it unreliable for ranking near-baseline quantized LLMs.
Principles
- Fidelity metrics lose ranking power in the "silent zone" of near-baseline models.
- KLD consistently tracks the volume of disagreement with a reference model.
- The direction of model disagreements is task-conditional, not universally tracked by KLD.
Method
A volume-direction decomposition explains score differences: score = ref_score + vol * (2f-1)/N, where 'vol' is total disagreement and 'f' is improvement fraction.
In practice
- Use high KLD/PPL values to flag severely damaged quantized models.
- Avoid using KLD to rank or select among near-baseline quantized candidates.
- Prioritize comprehensive downstream task evaluation for model selection.
Topics
- LLM Quantization
- Fidelity Metrics
- KL Divergence
- Silent Zone
- Model Evaluation
- Volume-Direction Decomposition
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.