Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
Summary
A new study introduces CapKV, a capacity-aware KV cache eviction method for large language models (LLMs) that addresses the memory bottleneck in long-context generation. Unlike existing heuristic-based policies, CapKV is grounded in the Information Bottleneck principle, deriving a closed-form mutual information objective under a linear-Gaussian surrogate of attention. This objective characterizes the effective information capacity of a retained KV cache subset, unifying various prior eviction strategies as approximations of a single capacity-maximization principle. CapKV directly targets information preservation via a log-determinant approximation using statistical leverage scores, replacing heuristic selection with a theoretically sound mechanism. Extensive experiments on models like Qwen3-8B, Qwen3-14B, Llama3.1-8B, Mistral-7B, and Qwen3-4B across long-context benchmarks like LongBench and AIME25 demonstrate that CapKV consistently outperforms prior methods, achieving a superior trade-off between memory efficiency and generational fidelity.
Key takeaway
For NLP engineers and research scientists optimizing LLM inference for long contexts, CapKV offers a principled approach to KV cache eviction. By explicitly maximizing information capacity rather than relying on heuristics, CapKV delivers superior performance and memory efficiency across diverse models and benchmarks. You should consider integrating CapKV into your inference pipelines, especially for applications requiring robust long-context reasoning or summarization, to achieve a better balance between memory usage and model fidelity.
Key insights
KV cache eviction can be unified as maximizing information capacity, improving LLM long-context inference.
Principles
- Maximize mutual information between future queries and retained KV cache.
- Information capacity correlates with downstream task performance.
- Balance query relevance with representational diversity.
Method
CapKV uses a log-determinant approximation with statistical leverage scores to greedily select KV pairs, optimizing information capacity. It incorporates query-dependent weights based on historical query statistics.
In practice
- Use CapKV for improved memory efficiency in long-context LLM inference.
- Tune the parameter τ to balance query relevance and cache diversity.
- Prioritize methods that preserve joint key-value information capacity.
Topics
- KV Cache Eviction
- Information Bottleneck Principle
- Large Language Model Inference
- CapKV
- Long-Context Generation
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.