Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Researchers from KAIST, Oracle, UC San Diego, and ETRI have developed Sink-Token-aware Pruning (SToP), a training-free method to enhance fine-grained video understanding in Video Large Language Models (Video LLMs) by addressing "sink tokens." Existing visual token pruning methods, while reducing computational costs by up to 90% of visual tokens, suffer significant performance degradation on tasks requiring precise visual grounding, such as hallucination evaluation, often due to these semantically uninformative tokens that attract excessive attention. SToP introduces a sink score to quantify a token's tendency to behave as a sink and integrates this score into both spatial and temporal pruning modules (STSP and STTP). Applied to state-of-the-art pruning methods like VisionZip, FastVid, and HoliTom, SToP consistently boosts performance across diverse benchmarks, including hallucination, open-ended generation, compositional reasoning, and MCQA, even at a 10% token retention ratio, and demonstrates adaptability across different LLM backbones like LLaVA-OneVision-7B and Qwen2.5-VL.

Key takeaway

For AI Engineers optimizing Video LLMs for fine-grained tasks like hallucination detection or compositional reasoning, you should integrate Sink-Token-aware Pruning (SToP) into your existing visual token pruning pipelines. This will significantly mitigate performance degradation, especially under aggressive token budget constraints (e.g., 10% retention), by ensuring that semantically rich visual cues are preserved over uninformative "sink tokens," thereby improving model accuracy and enabling more efficient inference with fewer frames.

Key insights

Sink tokens, semantically uninformative but high-attention visual tokens, hinder fine-grained video understanding in Video LLMs.

Principles

Method

SToP quantifies sink token tendency with a "sink score" and integrates it into spatial and temporal pruning modules (STSP, STTP) to penalize sink-prone tokens, ensuring retention of semantically rich visual cues.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.