Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
Summary
Google Research has introduced TurboQuant, a novel quantization algorithm designed to compress Key-Value (KV) caches in large language models by up to 6x. Unveiled on April 15, 2026, this technique achieves 3.5-bit compression with near-zero accuracy loss and requires no retraining, enabling the execution of models with massive context windows on less powerful hardware. TurboQuant employs a two-step process: first, data vectors undergo a randomized Hadamard transform to normalize value distribution, followed by a Quantized Johnson-Lindenstrauss (QJL) transform to remove bias. While Google claims 3.5-bit TurboQuant matches 16-bit precision on benchmarks like LongBench and Needle in a Haystack for Gemma and Mistral models, early community analysis suggests more modest, yet significant, real-world gains of 30-40% in memory reduction and processing speed. This addresses the substantial memory cost of KV caches, which can exceed model weights for long context windows, such as a Llama 70B model requiring 328GB for a 1M-token context.
Key takeaway
For NLP engineers and research scientists optimizing LLM inference, TurboQuant offers a significant advancement in managing KV cache memory. If your projects involve long context windows, adopting TurboQuant could substantially reduce VRAM requirements and improve processing speed, potentially allowing you to deploy larger models or longer contexts on existing hardware. Evaluate its 3.5-bit compression for models like Gemma and Mistral to achieve efficiency gains, keeping in mind that real-world improvements may be around 30-40% rather than the theoretical 6x.
Key insights
TurboQuant compresses LLM KV caches up to 6x with minimal accuracy loss, enabling long contexts on less capable hardware.
Principles
- KV cache memory cost grows linearly with token sequence length.
- Outliers in KV cache values hinder low-bit quantization.
- Memory bottlenecks are key to efficient LLM inference.
Method
TurboQuant uses a two-step approach: randomized Hadamard transform to normalize vector distribution, followed by a Quantized Johnson-Lindenstrauss (QJL) transform to remove bias and maintain inner product accuracy.
In practice
- Run Llama 70B with 1M-token context on a single H100 GPU.
- Reduce VRAM for long context LLM inference by 30-40%.
- Process large documents or codebases more cheaply.
Topics
- TurboQuant
- LLM Quantization
- Key-Value Cache
- Long Context Windows
- Hadamard Transform
Code references
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.