CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CompressKV is a novel KV-cache compression framework designed for GQA-based large language models, addressing the significant memory footprint and decoding costs associated with long-context inference. Unlike traditional eviction methods that apply heuristic token scoring across all attention heads, CompressKV specifically identifies Semantic Retrieval Heads (SRHs). These SRHs are crucial for capturing initial and final prompt tokens, as well as semantically important mid-context evidence, guiding the selection of KV pairs to retain. The framework further optimizes resource allocation by distributing cache budgets across layers based on pre-estimated layer-wise eviction errors. Experiments on LongBench question-answering tasks demonstrate that CompressKV maintains over 97% of full-cache performance while utilizing only 3% of the KV cache. On Needle-in-a-Haystack, it achieves 90% accuracy with just 0.7% KV storage, significantly improving the resource-performance trade-off compared to existing methods.

Key takeaway

For AI Engineers deploying long-context LLMs on memory-constrained hardware, CompressKV offers a significant efficiency improvement. You can achieve near full-cache performance (over 97%) with drastically reduced KV cache usage (as low as 0.7-3%), enabling more sustainable and cost-effective inference. Consider integrating this semantic-retrieval-guided compression to optimize resource utilization without compromising accuracy on tasks like question-answering or information retrieval.

Key insights

CompressKV improves long-context LLM inference efficiency by selectively compressing KV caches using Semantic Retrieval Heads and layer-wise budget allocation.

Principles

Method

CompressKV identifies Semantic Retrieval Heads (SRHs) to select critical tokens (initial, final, semantic mid-context) for KV-pair retention. It then allocates cache budgets across layers based on offline estimates of eviction error.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.