Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
Summary
SPpruner introduces a novel subject-centric progressive visual token reduction paradigm for Vision-Language Models (VLMs), addressing the high computational costs from extensive visual token sequences during inference. Unlike prior methods that often discard critical contextual information, SPpruner emulates human visual perception's "Focus-then-Context" mechanism. It employs a Focus Identification Module to capture a broad spectrum of salient visual subjects and a Context-Aware Structural Scanning Module to aggregate contextual cues, restoring global relational dependencies. This plug-and-play solution achieves up to a 2.53x speedup with only 22.2% visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop. It also reduces prefill latency by 53% and FLOPs by 64% on Qwen2.5-VL-7B, yielding a 2.15x speedup.
Key takeaway
For Machine Learning Engineers deploying Vision-Language Models in resource-constrained environments, SPpruner offers a compelling solution to accelerate inference. You should consider integrating this plug-and-play paradigm to achieve significant speedups and FLOPs reductions, such as 2.53x faster inference on Qwen2.5-VL, with minimal accuracy degradation. This approach ensures comprehensive visual understanding is maintained, even under aggressive token reduction, without requiring model retraining or fine-tuning.
Key insights
SPpruner reduces VLM computational cost by mimicking human "Focus-then-Context" perception to retain salient subjects and their structural context.
Principles
- Visual token reduction should prioritize subject perception over statistical filtering.
- Combine intrinsic visual saliency with semantic relevance for comprehensive subject identification.
- Aggregate contextual cues to restore global relational dependencies among subjects.
Method
SPpruner employs a Focus Identification Module for salient subject selection and a Context-Aware Structural Scanning Module with Structure-Responsive Sampling to aggregate contextual cues.
In practice
- Implement subject-centric token reduction for VLM inference acceleration.
- Score visual tokens using both intrinsic saliency and semantic relevance.
- Dynamically adjust token sampling stride based on local structural divergence.
Topics
- Vision-Language Models
- Token Reduction
- Computational Efficiency
- Visual Perception
- Deep Learning Inference
- Qwen2.5-VL
- LLaVA
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.