STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

STAR-KV is an adaptive low-rank KV cache compression framework designed to overcome limitations of prior methods that use fixed or heuristic rank selection, which often struggle with aggressive compression and minimal accuracy degradation. This framework incorporates three key mechanisms: a differentiable thresholding mechanism for optimal rank selection at both attention-head and block levels, a hybrid decomposition strategy that applies different low-rank factorizations based on key and value projection sensitivity, and a low-rank-aware mixed precision quantization leveraging data statistics for near lossless low-bit quantization. Evaluated across multiple Large Language Models and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, it delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput.

Key takeaway

For Machine Learning Engineers optimizing Large Language Model inference, STAR-KV offers a compelling solution to significantly reduce KV cache memory and boost throughput. You should consider integrating this adaptive low-rank compression framework to achieve up to 75% KV cache compression and 3.1x end-to-end generation speedup, especially when deploying models on resource-constrained hardware. Explore its publicly available code to implement fine-grained rank control and mixed precision quantization.

Key insights

STAR-KV adaptively compresses KV cache using fine-grained rank control, achieving significant memory and speed improvements.

Principles

Method

STAR-KV employs differentiable thresholding for adaptive rank selection, a hybrid decomposition strategy for key/value projections, and low-rank-aware mixed precision quantization.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.