CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
Summary
CompressKV is a novel KV-cache compression framework designed for resource-efficient long-context Large Language Model (LLM) inference, specifically targeting GQA-based architectures. It addresses the memory and decoding cost bottlenecks of KV caches on constrained hardware, which existing heuristic eviction methods often exacerbate by degrading performance. CompressKV distinguishes itself by identifying "Semantic Retrieval Heads" (SRHs) that capture critical initial, final, and semantically important mid-context tokens. These SRHs guide the selection of KV pairs to retain, rather than aggregating scores from all attention heads. Furthermore, the framework allocates cache budgets across layers based on offline eviction error estimates. Experiments on LongBench and Needle-in-a-Haystack demonstrate its effectiveness, preserving over 97% of full-cache performance with only 3% KV cache on LongBench QA tasks and achieving 90% accuracy with 0.7% KV storage on Needle-in-a-Haystack.
Key takeaway
For AI Engineers optimizing long-context LLM deployments on resource-constrained hardware, consider implementing CompressKV's semantic-retrieval-guided KV-cache compression. This approach significantly reduces memory footprint while maintaining high performance, as demonstrated by preserving over 97% accuracy with only 3% KV cache usage. You should explore integrating this method to improve throughput and reduce operational costs for your GQA-based LLMs.
Key insights
CompressKV uses Semantic Retrieval Heads and layer-wise budget allocation to efficiently compress KV caches for long-context LLMs.
Principles
- Attention heads exhibit distinct functionalities for token importance.
- Layer-wise cache budget allocation optimizes eviction efficiency.
Method
CompressKV identifies Semantic Retrieval Heads (SRHs) to select critical tokens for retention, then allocates cache budgets across layers using offline eviction error estimates.
In practice
- Achieve >97% performance with 3% KV cache on LongBench.
- Attain 90% accuracy using 0.7% KV storage on Needle-in-a-Haystack.
Topics
- KV Cache Compression
- Long-Context LLMs
- Semantic Retrieval
- GQA Architectures
- LLM Inference Optimization
- Memory Footprint Reduction
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.