Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A study evaluated fidelity metrics, such as KL divergence (KLD) and Perplexity (PPL), for selecting quantized Large Language Models (LLMs). Researchers tested 28 Qwen3.6-35B-A3B and 41 Devstral-Small-2-24B quantizations. They found KLD strongly correlated with benchmark scores across the full cohort, with ρ=-0.72 on Qwen and ρ=-0.86 on Devstral. However, this relationship collapses to non-significance in the "silent zone" of near-baseline models, showing ρ=+0.00 on Qwen. This collapse persists across 14 measurement variants and at the per-prompt level. The research attributes this to KLD primarily measuring the *volume* of disagreement with a reference model, not the *direction* of those disagreements.

Key takeaway

For MLOps engineers deploying quantized LLMs, relying solely on fidelity metrics like KLD for model selection is risky. While high KLD can flag severely degraded models, it provides no reliable ranking signal among near-baseline candidates. You should prioritize comprehensive downstream task evaluation, throughput, and hardware fit for selecting production-ready quantized models.

Key insights

KL divergence tracks disagreement volume, not direction, making it unreliable for ranking near-baseline quantized LLMs.

Principles

Fidelity metrics lose ranking power in the "silent zone" of near-baseline models.
KLD consistently tracks the volume of disagreement with a reference model.
The direction of model disagreements is task-conditional, not universally tracked by KLD.

Method

A volume-direction decomposition explains score differences: score = ref_score + vol * (2f-1)/N, where 'vol' is total disagreement and 'f' is improvement fraction.

In practice

Use high KLD/PPL values to flag severely damaged quantized models.
Avoid using KLD to rank or select among near-baseline quantized candidates.
Prioritize comprehensive downstream task evaluation for model selection.

Topics

LLM Quantization
Fidelity Metrics
KL Divergence
Silent Zone
Model Evaluation
Volume-Direction Decomposition

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.