Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

2026-03-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

A novel module called Spatio-Temporal Token Scoring (STTS) has been introduced to enhance the computational efficiency of video-based vision-language models (VLMs). STTS addresses the challenge of temporal redundancy in video tasks by pruning vision tokens across both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. This lightweight module is compatible with end-to-end training and learns to score tokens temporally via an auxiliary loss and spatially through LLM downstream gradients, supported by an efficient packing algorithm. STTS prunes 50% of vision tokens throughout the architecture, achieving a 62% improvement in training and inference efficiency with only a 0.7% average performance drop across 13 video QA tasks. Efficiency gains are more pronounced with increased sampled frames, and test-time scaling for long-video QA yields 0.5-1% performance gains over the baseline.

Key takeaway

For AI Engineers and Research Scientists developing video-based VLMs, STTS offers a significant efficiency boost. Implementing STTS can reduce computational costs by 62% during training and inference, with minimal performance impact (0.7% drop). Consider integrating STTS to optimize resource usage, especially when working with long videos or high frame rates, where efficiency gains are most substantial. This allows for more scalable and cost-effective VLM deployments.

Key insights

STTS unifies spatio-temporal vision token pruning across ViT and LLM for efficient video VLM processing.

Principles

Temporal redundancy is a key target for video VLM efficiency.
Unified pruning across ViT and LLM improves VLM efficiency.
Auxiliary loss can guide temporal token scoring.

Method

STTS prunes vision tokens by learning spatio-temporal scores via an auxiliary loss and LLM gradients, aided by an efficient packing algorithm, achieving 50% token reduction.

In practice

Prune 50% of vision tokens for 62% efficiency gain.
Apply test-time scaling for long-video QA performance.
Integrate STTS for end-to-end VLM training.

Topics

Token Pruning
Vision-Language Models
Video Processing
Computational Efficiency
Spatio-Temporal Scoring

Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.