STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
Summary
STaR-KV, or Spatio-Temporal Adaptive Re-weighting, is a novel training-free KV cache compression framework designed for GUI Vision-Language Models (VLMs). It addresses the critical deployment bottleneck where KV caches grow linearly with interaction steps, exemplified by UI-TARS-1.5-7B consuming 76 GB of GPU memory on just five screenshots. Existing compression methods are limited by aggregating visual-token importance into a single saliency map and using a fixed top-B cutoff. STaR-KV refutes these assumptions by calibrating token importance along three axes: subspace-aware scoring using online spatial mutual information, a temporal stability discount for redundant cache entries, and an entropy-derived temperature for adaptive score distribution reshaping. This approach achieves the strongest average accuracy among state-of-the-art methods like GUIKV and SnapKV across four GUI benchmarks, with no compression-stage FLOPs overhead (-0.07%) and reducing peak GPU memory by nearly 40% at a 20% KV-cache budget.
Key takeaway
For Machine Learning Engineers deploying GUI Vision-Language Models, if you are struggling with high GPU memory consumption from KV caches, consider implementing STaR-KV. This training-free compression framework can cut peak GPU memory by nearly 40% at a 20% KV-cache budget while maintaining or improving accuracy over existing methods. You can apply its subspace-aware scoring and temporal stability discount to optimize VLM performance and enable deployment on mainstream 80 GB accelerators.
Key insights
GUI VLM KV cache compression benefits from spatio-temporal adaptive re-weighting, moving beyond fixed saliency and cutoffs.
Principles
- Attention's spatial specialization is subspace-level and layer-migratory.
- KV cache score distributions dynamically drift along trajectories.
- Suppress redundant cache entries via temporal stability discounts.
Method
STaR-KV employs subspace-aware scoring via online spatial mutual information, a temporal stability discount, and an entropy-derived temperature to adaptively reshape KV cache score distributions.
In practice
- Cut peak GPU memory by nearly 40% for GUI VLMs.
- Achieve stronger average accuracy in GUI benchmarks.
- Enable GUI VLM deployment on 80 GB accelerators.
Topics
- KV Cache Compression
- GUI Vision-Language Models
- GPU Memory Optimization
- Spatio-Temporal Re-weighting
- Attention Mechanisms
- UI-TARS-1.5-7B
Code references
Best for: MLOps Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.