Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Summary
Researchers from KAIST, Oracle, UC San Diego, and ETRI have developed Sink-Token-aware Pruning (SToP), a training-free method to enhance fine-grained video understanding in Video Large Language Models (Video LLMs) by addressing "sink tokens." Existing visual token pruning methods, while reducing computational costs by up to 90% of visual tokens, suffer significant performance degradation on tasks requiring precise visual grounding, such as hallucination evaluation, often due to these semantically uninformative tokens that attract excessive attention. SToP introduces a sink score to quantify a token's tendency to behave as a sink and integrates this score into both spatial and temporal pruning modules (STSP and STTP). Applied to state-of-the-art pruning methods like VisionZip, FastVid, and HoliTom, SToP consistently boosts performance across diverse benchmarks, including hallucination, open-ended generation, compositional reasoning, and MCQA, even at a 10% token retention ratio, and demonstrates adaptability across different LLM backbones like LLaVA-OneVision-7B and Qwen2.5-VL.
Key takeaway
For AI Engineers optimizing Video LLMs for fine-grained tasks like hallucination detection or compositional reasoning, you should integrate Sink-Token-aware Pruning (SToP) into your existing visual token pruning pipelines. This will significantly mitigate performance degradation, especially under aggressive token budget constraints (e.g., 10% retention), by ensuring that semantically rich visual cues are preserved over uninformative "sink tokens," thereby improving model accuracy and enabling more efficient inference with fewer frames.
Key insights
Sink tokens, semantically uninformative but high-attention visual tokens, hinder fine-grained video understanding in Video LLMs.
Principles
- Explicitly suppressing sink tokens improves fine-grained video understanding.
- Temporal pruning implicitly suppresses sink tokens.
- Attention-based pruning benefits from sink-aware adjustments.
Method
SToP quantifies sink token tendency with a "sink score" and integrates it into spatial and temporal pruning modules (STSP, STTP) to penalize sink-prone tokens, ensuring retention of semantically rich visual cues.
In practice
- Apply SToP to existing pruning methods for improved video LLM performance.
- Prioritize fine-grained tasks for rigorous pruning method evaluation.
- Consider temporal pruning for implicit sink token suppression.
Topics
- Video LLMs
- Visual Token Pruning
- Sink Tokens
- Fine-Grained Video Understanding
- Sink-Token-aware Pruning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.