Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A study evaluated fidelity metrics, such as KL divergence (KLD) and Perplexity (PPL), for selecting quantized Large Language Models (LLMs). Researchers tested 28 Qwen3.6-35B-A3B and 41 Devstral-Small-2-24B quantizations. They found KLD strongly correlated with benchmark scores across the full cohort, with ρ=-0.72 on Qwen and ρ=-0.86 on Devstral. However, this relationship collapses to non-significance in the "silent zone" of near-baseline models, showing ρ=+0.00 on Qwen. This collapse persists across 14 measurement variants and at the per-prompt level. The research attributes this to KLD primarily measuring the *volume* of disagreement with a reference model, not the *direction* of those disagreements.

Key takeaway

For MLOps engineers deploying quantized LLMs, relying solely on fidelity metrics like KLD for model selection is risky. While high KLD can flag severely degraded models, it provides no reliable ranking signal among near-baseline candidates. You should prioritize comprehensive downstream task evaluation, throughput, and hardware fit for selecting production-ready quantized models.

Key insights

KL divergence tracks disagreement volume, not direction, making it unreliable for ranking near-baseline quantized LLMs.

Principles

Method

A volume-direction decomposition explains score differences: score = ref_score + vol * (2f-1)/N, where 'vol' is total disagreement and 'f' is improvement fraction.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.