Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

2026-04-30 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A new study introduces CapKV, a capacity-aware KV cache eviction method for large language models (LLMs) that addresses the memory bottleneck in long-context generation. Unlike existing heuristic-based policies, CapKV is grounded in the Information Bottleneck principle, deriving a closed-form mutual information objective under a linear-Gaussian surrogate of attention. This objective characterizes the effective information capacity of a retained KV cache subset, unifying various prior eviction strategies as approximations of a single capacity-maximization principle. CapKV directly targets information preservation via a log-determinant approximation using statistical leverage scores, replacing heuristic selection with a theoretically sound mechanism. Extensive experiments on models like Qwen3-8B, Qwen3-14B, Llama3.1-8B, Mistral-7B, and Qwen3-4B across long-context benchmarks like LongBench and AIME25 demonstrate that CapKV consistently outperforms prior methods, achieving a superior trade-off between memory efficiency and generational fidelity.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference for long contexts, CapKV offers a principled approach to KV cache eviction. By explicitly maximizing information capacity rather than relying on heuristics, CapKV delivers superior performance and memory efficiency across diverse models and benchmarks. You should consider integrating CapKV into your inference pipelines, especially for applications requiring robust long-context reasoning or summarization, to achieve a better balance between memory usage and model fidelity.

Key insights

KV cache eviction can be unified as maximizing information capacity, improving LLM long-context inference.

Principles

Maximize mutual information between future queries and retained KV cache.
Information capacity correlates with downstream task performance.
Balance query relevance with representational diversity.

Method

CapKV uses a log-determinant approximation with statistical leverage scores to greedily select KV pairs, optimizing information capacity. It incorporates query-dependent weights based on historical query statistics.

In practice

Use CapKV for improved memory efficiency in long-context LLM inference.
Tune the parameter τ to balance query relevance and cache diversity.
Prioritize methods that preserve joint key-value information capacity.

Topics

KV Cache Eviction
Information Bottleneck Principle
Large Language Model Inference
CapKV
Long-Context Generation

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.