KVarN: Variance-Normalized KV-Cache Quantization [R]

2026-06-04 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

KVarN is a novel KV-Cache quantization method designed to enhance large language model inference efficiency. It achieves 3-4x compression by combining Hadamard rotations with variance-normalization applied to both axes of the K and V matrices, followed by rounding. This approach yields virtually no accuracy drop, typically 0-1%, on challenging benchmarks like AIME24, and demonstrates a speed-up over the fp16 baseline within vLLM. The method's effectiveness stems from an analysis identifying large quantization errors, primarily caused by problematic token-scales in decode settings, as disproportionately impactful. KVarN addresses these critical error sources to maintain performance.

Key takeaway

For Machine Learning Engineers optimizing large language model deployment, KVarN presents a significant opportunity to reduce memory footprint and accelerate inference. You should consider integrating KVarN's 3-4x KV-cache compression, especially for decode-heavy applications like reasoning or code generation, to achieve speed-ups in vLLM without sacrificing model accuracy. Evaluate its performance on your specific 1B-4B parameter models to confirm expected benefits.

Key insights

KVarN efficiently quantizes KV-caches using Hadamard rotations and variance-normalization, achieving high compression with minimal accuracy loss.

Principles

Large quantization errors have disproportionate impact.
Outliers in key vectors disrupt uniform quantization.
Hadamard rotations smear outliers, managing variance.

Method

KVarN combines Hadamard rotations with variance-normalization on both K and V matrix axes, then rounds to the nearest value for KV-cache quantization.

In practice

Achieve 3-4x KV-cache memory compression.
Improve LLM decode speed in vLLM.
Apply to reasoning, code-gen, agentic LLMs.

Topics

KV-Cache Quantization
LLM Inference Optimization
Hadamard Rotations
Variance Normalization
vLLM
Model Compression

Code references

huawei-csl/KVarN

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.