CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
Summary
CoRDS (Coreset-based Representative and Diverse Selection) is a novel, training-free, and query-agnostic method for compressing the Key-Value (KV) cache in large Vision-Language Models (VLMs) for streaming video understanding. It redefines KV-cache compression as a coreset selection problem, aiming to retain a small, representative subset of the accumulated visual history rather than relying on local token-wise heuristics. CoRDS introduces a bicriteria objective that balances coverage in both key and value spaces, preserving retrieval structure and output-relevant information. Additionally, it incorporates an orthogonality-driven diversity criterion to prevent redundant selections, favoring candidates that contribute new directions. Evaluated across four open-source VLMs (Qwen2-VL-7B, Qwen2.5-VL-3B/7B, LLaVA-NeXT-Video-7B) and five long-video and streaming-video benchmarks (EgoSchema, MLVU, VideoMME, OVO-Bench, StreamingBench), CoRDS consistently outperforms existing heuristic streaming compression baselines under fixed cache budgets, often matching or exceeding uncompressed full-KV performance, particularly at aggressive compression ratios.
Key takeaway
For AI Engineers and Research Scientists developing or deploying VLMs for streaming video applications, CoRDS offers a superior method for KV-cache compression. Its coreset-based approach ensures a more representative and less redundant memory, leading to improved accuracy and efficiency, especially under tight memory budgets. You should consider integrating CoRDS to enhance performance and reduce VRAM and energy consumption in your streaming VLM deployments, particularly for long-duration video processing.
Key insights
Coreset selection offers a principled approach to KV-cache compression for streaming video understanding, outperforming heuristic methods.
Principles
- Retain a globally representative memory, not just locally salient tokens.
- Balance key and value space coverage for effective VLM attention.
- Promote diversity to avoid redundant token retention.
Method
CoRDS formulates KV-cache compression as a coreset selection problem, using a bicriteria objective for joint KV representation and a D2-style farthest-first selection with an orthogonal anti-redundancy term.
In practice
- Apply CoRDS to bottom 25% of decoder layers for optimal balance.
- Utilize cross-layer cascade for efficiency in long videos.
- Prioritize value reconstruction (low alpha) for better attention output.
Topics
- CoRDS
- Streaming Video Understanding
- KV-Cache Compression
- Coreset Selection
- Vision-Language Models
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.