CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CoRDS (Coreset-based Representative and Diverse Selection) is a novel, training-free, and query-agnostic method for compressing the Key-Value (KV) cache in large Vision-Language Models (VLMs) for streaming video understanding. It redefines KV-cache compression as a coreset selection problem, aiming to retain a small, representative subset of the accumulated visual history rather than relying on local token-wise heuristics. CoRDS introduces a bicriteria objective that balances coverage in both key and value spaces, preserving retrieval structure and output-relevant information. Additionally, it incorporates an orthogonality-driven diversity criterion to prevent redundant selections, favoring candidates that contribute new directions. Evaluated across four open-source VLMs (Qwen2-VL-7B, Qwen2.5-VL-3B/7B, LLaVA-NeXT-Video-7B) and five long-video and streaming-video benchmarks (EgoSchema, MLVU, VideoMME, OVO-Bench, StreamingBench), CoRDS consistently outperforms existing heuristic streaming compression baselines under fixed cache budgets, often matching or exceeding uncompressed full-KV performance, particularly at aggressive compression ratios.

Key takeaway

For AI Engineers and Research Scientists developing or deploying VLMs for streaming video applications, CoRDS offers a superior method for KV-cache compression. Its coreset-based approach ensures a more representative and less redundant memory, leading to improved accuracy and efficiency, especially under tight memory budgets. You should consider integrating CoRDS to enhance performance and reduce VRAM and energy consumption in your streaming VLM deployments, particularly for long-duration video processing.

Key insights

Coreset selection offers a principled approach to KV-cache compression for streaming video understanding, outperforming heuristic methods.

Principles

Method

CoRDS formulates KV-cache compression as a coreset selection problem, using a bicriteria objective for joint KV representation and a D2-style farthest-first selection with an orthogonal anti-redundancy term.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.