STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

STaR-KV, or Spatio-Temporal Adaptive Re-weighting, is a novel training-free KV cache compression framework designed for GUI Vision-Language Models (VLMs). It addresses the critical deployment bottleneck where KV caches grow linearly with interaction steps, exemplified by UI-TARS-1.5-7B consuming 76 GB of GPU memory on just five screenshots. Existing compression methods are limited by aggregating visual-token importance into a single saliency map and using a fixed top-B cutoff. STaR-KV refutes these assumptions by calibrating token importance along three axes: subspace-aware scoring using online spatial mutual information, a temporal stability discount for redundant cache entries, and an entropy-derived temperature for adaptive score distribution reshaping. This approach achieves the strongest average accuracy among state-of-the-art methods like GUIKV and SnapKV across four GUI benchmarks, with no compression-stage FLOPs overhead (-0.07%) and reducing peak GPU memory by nearly 40% at a 20% KV-cache budget.

Key takeaway

For Machine Learning Engineers deploying GUI Vision-Language Models, if you are struggling with high GPU memory consumption from KV caches, consider implementing STaR-KV. This training-free compression framework can cut peak GPU memory by nearly 40% at a 20% KV-cache budget while maintaining or improving accuracy over existing methods. You can apply its subspace-aware scoring and temporal stability discount to optimize VLM performance and enable deployment on mainstream 80 GB accelerators.

Key insights

GUI VLM KV cache compression benefits from spatio-temporal adaptive re-weighting, moving beyond fixed saliency and cutoffs.

Principles

Attention's spatial specialization is subspace-level and layer-migratory.
KV cache score distributions dynamically drift along trajectories.
Suppress redundant cache entries via temporal stability discounts.

Method

STaR-KV employs subspace-aware scoring via online spatial mutual information, a temporal stability discount, and an entropy-derived temperature to adaptively reshape KV cache score distributions.

In practice

Cut peak GPU memory by nearly 40% for GUI VLMs.
Achieve stronger average accuracy in GUI benchmarks.
Enable GUI VLM deployment on 80 GB accelerators.

Topics

KV Cache Compression
GUI Vision-Language Models
GPU Memory Optimization
Spatio-Temporal Re-weighting
Attention Mechanisms
UI-TARS-1.5-7B

Code references

kawhiiiileo/STaR-KV

Best for: MLOps Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.