Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

SPpruner introduces a novel subject-centric progressive visual token reduction paradigm for Vision-Language Models (VLMs), addressing the high computational costs from extensive visual token sequences during inference. Unlike prior methods that often discard critical contextual information, SPpruner emulates human visual perception's "Focus-then-Context" mechanism. It employs a Focus Identification Module to capture a broad spectrum of salient visual subjects and a Context-Aware Structural Scanning Module to aggregate contextual cues, restoring global relational dependencies. This plug-and-play solution achieves up to a 2.53x speedup with only 22.2% visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop. It also reduces prefill latency by 53% and FLOPs by 64% on Qwen2.5-VL-7B, yielding a 2.15x speedup.

Key takeaway

For Machine Learning Engineers deploying Vision-Language Models in resource-constrained environments, SPpruner offers a compelling solution to accelerate inference. You should consider integrating this plug-and-play paradigm to achieve significant speedups and FLOPs reductions, such as 2.53x faster inference on Qwen2.5-VL, with minimal accuracy degradation. This approach ensures comprehensive visual understanding is maintained, even under aggressive token reduction, without requiring model retraining or fine-tuning.

Key insights

SPpruner reduces VLM computational cost by mimicking human "Focus-then-Context" perception to retain salient subjects and their structural context.

Principles

Method

SPpruner employs a Focus Identification Module for salient subject selection and a Context-Aware Structural Scanning Module with Structure-Responsive Sampling to aggregate contextual cues.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.