PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
Summary
PolyKV is a novel layer-wise KV cache optimization framework designed to reduce memory costs for long-context large language model inference. It addresses the limitations of current approaches that apply uniform compression policies and cache budgets across all transformer layers, which often ignore the varying roles of layers during prefill and decoding. PolyKV intelligently routes each layer to an appropriate KV compression policy based on layer-level signals and assigns non-uniform budgets within a fixed total budget, enabling heterogeneous compositions of existing methods. Experimental results on LLaMA-3.1-8B and Qwen3-8B demonstrate its effectiveness. With a 512-token average KV budget, PolyKV recovered 54.5% and 25.7% of the LongBench performance gap, respectively, compared to the strongest single-policy baseline and FullKV. Across a 128-1024 budget sweep, it consistently improved performance by 1.7%-6.4%, recovering 40.0%-54.5% of the FullKV gap.
Key takeaway
For Machine Learning Engineers optimizing long-context LLM inference, you should reconsider uniform KV cache compression strategies. PolyKV demonstrates that applying heterogeneous, layer-wise policies and non-uniform budget allocation significantly improves performance and memory efficiency. This approach recovers substantial performance gaps, suggesting you can achieve better results by tailoring compression to individual transformer layers rather than using a single global policy.
Key insights
PolyKV optimizes KV cache compression by applying heterogeneous, layer-wise policies and non-uniform budget allocation, significantly improving long-context LLM performance.
Principles
- Different transformer layers require distinct KV cache strategies.
- Uniform KV cache policies are suboptimal for long-context LLMs.
- Layer-level signals can guide KV cache policy selection.
Method
PolyKV routes each transformer layer to a suitable KV compression policy using layer-level signals. It then assigns non-uniform cache budgets under a fixed total budget, enabling heterogeneous method compositions.
In practice
- Improve LongBench performance on LLaMA-3.1-8B.
- Enhance Qwen3-8B long-context inference efficiency.
- Recover significant FullKV performance gaps.
Topics
- KV Cache Compression
- Large Language Models
- LLM Inference Optimization
- Transformer Layers
- LLaMA-3.1-8B
- Qwen3-8B
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.