STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
Summary
STAR-KV is an adaptive low-rank KV cache compression framework designed to overcome limitations of prior methods that use fixed or heuristic rank selection, which often struggle with aggressive compression and minimal accuracy degradation. This framework incorporates three key mechanisms: a differentiable thresholding mechanism for optimal rank selection at both attention-head and block levels, a hybrid decomposition strategy that applies different low-rank factorizations based on key and value projection sensitivity, and a low-rank-aware mixed precision quantization leveraging data statistics for near lossless low-bit quantization. Evaluated across multiple Large Language Models and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, it delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput.
Key takeaway
For Machine Learning Engineers optimizing Large Language Model inference, STAR-KV offers a compelling solution to significantly reduce KV cache memory and boost throughput. You should consider integrating this adaptive low-rank compression framework to achieve up to 75% KV cache compression and 3.1x end-to-end generation speedup, especially when deploying models on resource-constrained hardware. Explore its publicly available code to implement fine-grained rank control and mixed precision quantization.
Key insights
STAR-KV adaptively compresses KV cache using fine-grained rank control, achieving significant memory and speed improvements.
Principles
- Low-rank projection exploits hidden-dimension redundancy.
- Differentiable thresholding optimizes rank selection.
- Hybrid decomposition adapts to projection sensitivity.
Method
STAR-KV employs differentiable thresholding for adaptive rank selection, a hybrid decomposition strategy for key/value projections, and low-rank-aware mixed precision quantization.
In practice
- Achieve up to 75% KV cache compression.
- Utilize Triton-based GPU kernels for attention speedup.
- Combine with quantization for 20x KV cache reduction.
Topics
- KV Cache Compression
- Low-Rank Projection
- Large Language Models
- Mixed Precision Quantization
- Triton GPU Kernels
- Adaptive Rank Control
Code references
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.