Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment
Summary
A study evaluating fidelity metrics for quantized Large Language Model (LLM) deployment reveals that while per-token KL divergence (KLD) shows a strong correlation with benchmark scores across full quantization cohorts, this relationship significantly diminishes in the near-baseline "silent zone." Researchers tested KLD on 28-quant Qwen3.6-35B-A3B and 41-quant Devstral-Small-2-24B, finding full-cohort correlations of ρ=-0.72 and ρ=-0.86 respectively (both p<0.001). However, in the silent zone, correlations dropped to ρ=+0.00 for Qwen and ρ=-0.24 (p=0.36) for Devstral. This collapse was consistent across 14 measurement variants, including different KLD aggregations and perplexity formulations. Furthermore, KLD exhibited only weak per-prompt failure-prediction power on code, with geometric-mean ratios of [1.08,1.22] on LiveCodeBench, and poor cross-model routing accuracy (42.3%-49.4%). The analysis attributes this breakdown to KLD primarily measuring the volume of disagreement, not its direction.
Key takeaway
For Machine Learning Engineers evaluating quantized LLMs, relying solely on fidelity metrics like per-token KL divergence (KLD) for performance prediction is risky. Your quantization strategy should not assume KLD reliably correlates with benchmark scores, especially for models performing near baseline. Instead, prioritize direct downstream benchmark evaluations to accurately assess model quality and ensure robust deployment decisions, as KLD primarily measures disagreement volume, not direction.
Key insights
Fidelity metrics like KLD are unreliable proxies for LLM benchmark quality, especially in near-baseline performance zones.
Principles
- KLD primarily measures disagreement volume, not direction.
- Correlation between KLD and benchmark scores is context-dependent.
- KLD offers weak per-prompt failure prediction on code.
Method
The study tested KLD on 28-quant Qwen3.6-35B-A3B and 41-quant Devstral-Small-2-24B across downstream benchmarks, analyzing correlation in full and "silent zone" cohorts, and per-prompt failure prediction.
In practice
- Do not rely solely on KLD for fine-grained quantization evaluation.
- Validate KLD against actual downstream benchmarks.
- Consider KLD's limitations for cross-model routing.
Topics
- LLM Quantization
- Fidelity Metrics
- KL Divergence
- Benchmark Evaluation
- Model Deployment
- Qwen3.6-35B-A3B
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.