Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A new study introduces CapKV, a capacity-aware KV cache eviction method for large language models (LLMs) that addresses the memory bottleneck in long-context generation. Unlike existing heuristic-based policies, CapKV is grounded in the Information Bottleneck principle, deriving a closed-form mutual information objective under a linear-Gaussian surrogate of attention. This objective characterizes the effective information capacity of a retained KV cache subset, unifying various prior eviction strategies as approximations of a single capacity-maximization principle. CapKV directly targets information preservation via a log-determinant approximation using statistical leverage scores, replacing heuristic selection with a theoretically sound mechanism. Extensive experiments on models like Qwen3-8B, Qwen3-14B, Llama3.1-8B, Mistral-7B, and Qwen3-4B across long-context benchmarks like LongBench and AIME25 demonstrate that CapKV consistently outperforms prior methods, achieving a superior trade-off between memory efficiency and generational fidelity.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference for long contexts, CapKV offers a principled approach to KV cache eviction. By explicitly maximizing information capacity rather than relying on heuristics, CapKV delivers superior performance and memory efficiency across diverse models and benchmarks. You should consider integrating CapKV into your inference pipelines, especially for applications requiring robust long-context reasoning or summarization, to achieve a better balance between memory usage and model fidelity.

Key insights

KV cache eviction can be unified as maximizing information capacity, improving LLM long-context inference.

Principles

Method

CapKV uses a log-determinant approximation with statistical leverage scores to greedily select KV pairs, optimizing information capacity. It incorporates query-dependent weights based on historical query statistics.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.