Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Summary
A novel module called Spatio-Temporal Token Scoring (STTS) has been introduced to enhance the computational efficiency of video-based vision-language models (VLMs). STTS addresses the challenge of temporal redundancy in video tasks by pruning vision tokens across both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. This lightweight module is compatible with end-to-end training and learns to score tokens temporally via an auxiliary loss and spatially through LLM downstream gradients, supported by an efficient packing algorithm. STTS prunes 50% of vision tokens throughout the architecture, achieving a 62% improvement in training and inference efficiency with only a 0.7% average performance drop across 13 video QA tasks. Efficiency gains are more pronounced with increased sampled frames, and test-time scaling for long-video QA yields 0.5-1% performance gains over the baseline.
Key takeaway
For AI Engineers and Research Scientists developing video-based VLMs, STTS offers a significant efficiency boost. Implementing STTS can reduce computational costs by 62% during training and inference, with minimal performance impact (0.7% drop). Consider integrating STTS to optimize resource usage, especially when working with long videos or high frame rates, where efficiency gains are most substantial. This allows for more scalable and cost-effective VLM deployments.
Key insights
STTS unifies spatio-temporal vision token pruning across ViT and LLM for efficient video VLM processing.
Principles
- Temporal redundancy is a key target for video VLM efficiency.
- Unified pruning across ViT and LLM improves VLM efficiency.
- Auxiliary loss can guide temporal token scoring.
Method
STTS prunes vision tokens by learning spatio-temporal scores via an auxiliary loss and LLM gradients, aided by an efficient packing algorithm, achieving 50% token reduction.
In practice
- Prune 50% of vision tokens for 62% efficiency gain.
- Apply test-time scaling for long-video QA performance.
- Integrate STTS for end-to-end VLM training.
Topics
- Token Pruning
- Vision-Language Models
- Video Processing
- Computational Efficiency
- Spatio-Temporal Scoring
Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.