KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy
Summary
KVCapsule is a novel KV cache compression framework designed to enhance the efficiency of Vision-Language Models (VLMs) during autoregressive decoding. VLMs, which extend Large Language Models (LLMs) with multimodal reasoning, face significant memory overhead due to long token sequences and dense feature representations from image inputs. KVCapsule addresses this by analyzing vision token behavior, revealing sequential redundancy, dynamic attention patterns, and asymmetric redundancy between keys and values. The framework keeps the pretrained VLM backbone frozen and integrates lightweight compression and reconstruction components. It employs selective retention and MLP-based reconstruction for keys to preserve attention geometry, and sequence-level PCA for values to maintain semantic content. Evaluated across multiple VLMs and benchmarks, KVCapsule achieves up to 2x improvement in Tokens Per Second (TPS) and a 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible accuracy degradation.
Key takeaway
For research scientists and computer vision engineers developing or deploying Vision-Language Models, KVCapsule offers a critical pathway to scale VLM inference under memory constraints. By adopting its asymmetric, reconstructable KV cache compression, you can achieve substantial memory savings and throughput improvements (up to 2x TPS, 2.4x memory reduction) without sacrificing model accuracy, especially for long-context and multi-image scenarios. Consider implementing KVCapsule to enable more resource-efficient and accessible multimodal reasoning.
Key insights
KVCapsule efficiently compresses VLM KV caches by leveraging asymmetric redundancy and dynamic attention patterns in vision tokens.
Principles
- Vision token importance shifts dynamically during VLM decoding.
- Visual keys and values exhibit distinct redundancy patterns.
- Layer-wise compression tolerance varies across VLM depths.
Method
KVCapsule uses a hybrid approach: MLP-based reconstruction for keys to preserve attention geometry and sequence-level PCA for values to retain semantic content, all within a backbone-frozen VLM framework.
In practice
- Integrate KVCapsule for 2x TPS improvement in VLM inference.
- Reduce VLM KV cache memory by 2.4x with KVCapsule.
- Apply pyramid compression schedules for optimal layer-wise efficiency.
Topics
- KV Cache Compression
- Vision-Language Models
- Asymmetric Compression
- MLP Reconstruction
- PCA Compression
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.