Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

SPARE is a novel subspace reconstruction method designed to enhance the efficiency of Vision Language Models (VLMs) by addressing the substantial computational overhead caused by numerous visual tokens. Unlike existing diversity maximization strategies that rely on cosine-based similarity and discard crucial magnitude information, SPARE reformulates token pruning as a column subset selection problem. It explicitly minimizes reconstruction error by iteratively selecting tokens with large projection residuals, moving beyond simple angular diversity. The method also identifies and integrates an "anti-relevance" phenomenon, where tokens with lower image-text relevance scores surprisingly better preserve contextual information, into its selection criteria. Extensive experiments demonstrate SPARE's leading performance across various VLMs and benchmarks, showing significant improvements on compositional multi-skill reasoning tasks. Notably, when applied to LLaVA, SPARE removes up to 94% of visual tokens while maintaining 95% of the baseline performance, all without requiring additional training.

Key takeaway

For Machine Learning Engineers optimizing Vision Language Models for deployment or complex reasoning tasks, SPARE offers a compelling training-free solution. You should consider integrating this subspace reconstruction method to significantly reduce visual token overhead. This approach allows you to achieve up to 94% token removal while retaining 95% baseline performance, particularly enhancing compositional multi-skill reasoning without additional training costs.

Key insights

SPARE prunes VLM visual tokens by minimizing reconstruction error and leveraging anti-relevance for context preservation.

Principles

Method

SPARE iteratively selects visual tokens by minimizing reconstruction error through column subset selection, further incorporating an anti-relevance criterion to prioritize context-preserving tokens.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.