VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

VisPCO is a new framework designed to optimize visual token pruning configurations in vision-language models (VLMs), addressing the quadratic computational growth from high-resolution image and video processing. Existing pruning methods often use predefined configurations without ensuring computational-performance optimality. VisPCO formulates this as a Pareto configuration optimization problem, employing continuous relaxation and straight-through estimators for gradient-based search, solved using the Augmented Lagrangian method. Experiments across 8 visual benchmarks show VisPCO accurately approximates the empirical Pareto frontier derived from grid search and demonstrates strong generalization across various pruning methods and VLM architectures. The framework also uses learnable kernel functions to analyze layer-wise pruning patterns, indicating that multi-step progressive pruning better captures VLMs' hierarchical compression structure, leading to improved accuracy-efficiency trade-offs over single-layer methods.

Key takeaway

For AI Engineers and Research Scientists working with vision-language models, VisPCO offers a systematic approach to optimize visual token pruning. By leveraging its Pareto configuration optimization, you can automatically identify configurations that achieve superior accuracy-efficiency trade-offs, moving beyond manual, predefined settings. This framework can significantly reduce computational costs for high-resolution inputs while maintaining performance, especially when adopting multi-step progressive pruning strategies.

Key insights

VisPCO optimizes visual token pruning in VLMs by formulating it as a Pareto configuration problem for optimal efficiency.

Principles

Method

VisPCO uses continuous relaxation, straight-through estimators, and the Augmented Lagrangian method for gradient-based search to identify optimal pruning configurations.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.