Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

SPpruner introduces a novel subject-centric progressive visual token reduction paradigm for Vision-Language Models (VLMs), addressing the high computational costs from extensive visual token sequences during inference. Unlike prior methods that often discard critical contextual information, SPpruner emulates human visual perception's "Focus-then-Context" mechanism. It employs a Focus Identification Module to capture a broad spectrum of salient visual subjects and a Context-Aware Structural Scanning Module to aggregate contextual cues, restoring global relational dependencies. This plug-and-play solution achieves up to a 2.53x speedup with only 22.2% visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop. It also reduces prefill latency by 53% and FLOPs by 64% on Qwen2.5-VL-7B, yielding a 2.15x speedup.

Key takeaway

For Machine Learning Engineers deploying Vision-Language Models in resource-constrained environments, SPpruner offers a compelling solution to accelerate inference. You should consider integrating this plug-and-play paradigm to achieve significant speedups and FLOPs reductions, such as 2.53x faster inference on Qwen2.5-VL, with minimal accuracy degradation. This approach ensures comprehensive visual understanding is maintained, even under aggressive token reduction, without requiring model retraining or fine-tuning.

Key insights

SPpruner reduces VLM computational cost by mimicking human "Focus-then-Context" perception to retain salient subjects and their structural context.

Principles

Visual token reduction should prioritize subject perception over statistical filtering.
Combine intrinsic visual saliency with semantic relevance for comprehensive subject identification.
Aggregate contextual cues to restore global relational dependencies among subjects.

Method

SPpruner employs a Focus Identification Module for salient subject selection and a Context-Aware Structural Scanning Module with Structure-Responsive Sampling to aggregate contextual cues.

In practice

Implement subject-centric token reduction for VLM inference acceleration.
Score visual tokens using both intrinsic saliency and semantic relevance.
Dynamically adjust token sampling stride based on local structural divergence.

Topics

Vision-Language Models
Token Reduction
Computational Efficiency
Visual Perception
Deep Learning Inference
Qwen2.5-VL
LLaVA

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.