KVarN: Variance-Normalized KV-Cache Quantization [R]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

KVarN is a novel KV-Cache quantization method designed to enhance large language model inference efficiency. It achieves 3-4x compression by combining Hadamard rotations with variance-normalization applied to both axes of the K and V matrices, followed by rounding. This approach yields virtually no accuracy drop, typically 0-1%, on challenging benchmarks like AIME24, and demonstrates a speed-up over the fp16 baseline within vLLM. The method's effectiveness stems from an analysis identifying large quantization errors, primarily caused by problematic token-scales in decode settings, as disproportionately impactful. KVarN addresses these critical error sources to maintain performance.

Key takeaway

For Machine Learning Engineers optimizing large language model deployment, KVarN presents a significant opportunity to reduce memory footprint and accelerate inference. You should consider integrating KVarN's 3-4x KV-cache compression, especially for decode-heavy applications like reasoning or code generation, to achieve speed-ups in vLLM without sacrificing model accuracy. Evaluate its performance on your specific 1B-4B parameter models to confirm expected benefits.

Key insights

KVarN efficiently quantizes KV-caches using Hadamard rotations and variance-normalization, achieving high compression with minimal accuracy loss.

Principles

Method

KVarN combines Hadamard rotations with variance-normalization on both K and V matrix axes, then rounds to the nearest value for KV-cache quantization.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.