Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

2026-06-17 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A study evaluating fidelity metrics for quantized Large Language Model (LLM) deployment reveals that while per-token KL divergence (KLD) shows a strong correlation with benchmark scores across full quantization cohorts, this relationship significantly diminishes in the near-baseline "silent zone." Researchers tested KLD on 28-quant Qwen3.6-35B-A3B and 41-quant Devstral-Small-2-24B, finding full-cohort correlations of ρ=-0.72 and ρ=-0.86 respectively (both p<0.001). However, in the silent zone, correlations dropped to ρ=+0.00 for Qwen and ρ=-0.24 (p=0.36) for Devstral. This collapse was consistent across 14 measurement variants, including different KLD aggregations and perplexity formulations. Furthermore, KLD exhibited only weak per-prompt failure-prediction power on code, with geometric-mean ratios of [1.08,1.22] on LiveCodeBench, and poor cross-model routing accuracy (42.3%-49.4%). The analysis attributes this breakdown to KLD primarily measuring the volume of disagreement, not its direction.

Key takeaway

For Machine Learning Engineers evaluating quantized LLMs, relying solely on fidelity metrics like per-token KL divergence (KLD) for performance prediction is risky. Your quantization strategy should not assume KLD reliably correlates with benchmark scores, especially for models performing near baseline. Instead, prioritize direct downstream benchmark evaluations to accurately assess model quality and ensure robust deployment decisions, as KLD primarily measures disagreement volume, not direction.

Key insights

Fidelity metrics like KLD are unreliable proxies for LLM benchmark quality, especially in near-baseline performance zones.

Principles

KLD primarily measures disagreement volume, not direction.
Correlation between KLD and benchmark scores is context-dependent.
KLD offers weak per-prompt failure prediction on code.

Method

The study tested KLD on 28-quant Qwen3.6-35B-A3B and 41-quant Devstral-Small-2-24B across downstream benchmarks, analyzing correlation in full and "silent zone" cohorts, and per-prompt failure prediction.

In practice

Do not rely solely on KLD for fine-grained quantization evaluation.
Validate KLD against actual downstream benchmarks.
Consider KLD's limitations for cross-model routing.

Topics

LLM Quantization
Fidelity Metrics
KL Divergence
Benchmark Evaluation
Model Deployment
Qwen3.6-35B-A3B

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.