CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

2024-04-30 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CoRDS (Coreset-based Representative and Diverse Selection) is a novel, training-free, and query-agnostic method for compressing the Key-Value (KV) cache in large Vision-Language Models (VLMs) for streaming video understanding. It redefines KV-cache compression as a coreset selection problem, aiming to retain a small, representative subset of the accumulated visual history rather than relying on local token-wise heuristics. CoRDS introduces a bicriteria objective that balances coverage in both key and value spaces, preserving retrieval structure and output-relevant information. Additionally, it incorporates an orthogonality-driven diversity criterion to prevent redundant selections, favoring candidates that contribute new directions. Evaluated across four open-source VLMs (Qwen2-VL-7B, Qwen2.5-VL-3B/7B, LLaVA-NeXT-Video-7B) and five long-video and streaming-video benchmarks (EgoSchema, MLVU, VideoMME, OVO-Bench, StreamingBench), CoRDS consistently outperforms existing heuristic streaming compression baselines under fixed cache budgets, often matching or exceeding uncompressed full-KV performance, particularly at aggressive compression ratios.

Key takeaway

For AI Engineers and Research Scientists developing or deploying VLMs for streaming video applications, CoRDS offers a superior method for KV-cache compression. Its coreset-based approach ensures a more representative and less redundant memory, leading to improved accuracy and efficiency, especially under tight memory budgets. You should consider integrating CoRDS to enhance performance and reduce VRAM and energy consumption in your streaming VLM deployments, particularly for long-duration video processing.

Key insights

Coreset selection offers a principled approach to KV-cache compression for streaming video understanding, outperforming heuristic methods.

Principles

Retain a globally representative memory, not just locally salient tokens.
Balance key and value space coverage for effective VLM attention.
Promote diversity to avoid redundant token retention.

Method

CoRDS formulates KV-cache compression as a coreset selection problem, using a bicriteria objective for joint KV representation and a D2-style farthest-first selection with an orthogonal anti-redundancy term.

In practice

Apply CoRDS to bottom 25% of decoder layers for optimal balance.
Utilize cross-layer cascade for efficiency in long videos.
Prioritize value reconstruction (low alpha) for better attention output.

Topics

CoRDS
Streaming Video Understanding
KV-Cache Compression
Coreset Selection
Vision-Language Models

Code references

ailarmhz/CoRDS

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.