CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
Summary
CompressKV is a novel KV-cache compression framework designed for GQA-based large language models, addressing the significant memory footprint and decoding costs associated with long-context inference. Unlike traditional eviction methods that apply heuristic token scoring across all attention heads, CompressKV specifically identifies Semantic Retrieval Heads (SRHs). These SRHs are crucial for capturing initial and final prompt tokens, as well as semantically important mid-context evidence, guiding the selection of KV pairs to retain. The framework further optimizes resource allocation by distributing cache budgets across layers based on pre-estimated layer-wise eviction errors. Experiments on LongBench question-answering tasks demonstrate that CompressKV maintains over 97% of full-cache performance while utilizing only 3% of the KV cache. On Needle-in-a-Haystack, it achieves 90% accuracy with just 0.7% KV storage, significantly improving the resource-performance trade-off compared to existing methods.
Key takeaway
For AI Engineers deploying long-context LLMs on memory-constrained hardware, CompressKV offers a significant efficiency improvement. You can achieve near full-cache performance (over 97%) with drastically reduced KV cache usage (as low as 0.7-3%), enabling more sustainable and cost-effective inference. Consider integrating this semantic-retrieval-guided compression to optimize resource utilization without compromising accuracy on tasks like question-answering or information retrieval.
Key insights
CompressKV improves long-context LLM inference efficiency by selectively compressing KV caches using Semantic Retrieval Heads and layer-wise budget allocation.
Principles
- Attention heads exhibit distinct functional roles.
- Semantic importance dictates KV-cache token retention.
- Layer-wise error estimation optimizes cache budget allocation.
Method
CompressKV identifies Semantic Retrieval Heads (SRHs) to select critical tokens (initial, final, semantic mid-context) for KV-pair retention. It then allocates cache budgets across layers based on offline estimates of eviction error.
In practice
- Preserve >97% performance with 3% KV cache.
- Achieve 90% accuracy using 0.7% KV storage.
- Deploy long-context LLMs on resource-constrained hardware.
Topics
- KV-Cache Compression
- Long-Context LLMs
- GQA Models
- Semantic Retrieval Heads
- LLM Inference Optimization
- Memory Efficiency
Code references
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.