TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
Summary
TOPS, a novel visual token pruning module, addresses the computational overhead in Multimodal Large Language Models (MLLMs) caused by numerous visual tokens. Existing pruning methods often fall short, either retaining redundant tokens or being instruction-agnostic. This research re-conceptualizes visual token pruning from first principles, formulating it as constructing Token Optimal Preservation Sets through a top-down information-theoretic analysis. TOPS identifies three core principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. The proposed module is training-free and model-agnostic, demonstrating superior performance across 7 MLLM backbones and 14 benchmarks. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting potential for hallucination mitigation and lightweight MLLM design.
Key takeaway
For Machine Learning Engineers optimizing MLLM inference efficiency, TOPS provides a principled, training-free solution to drastically reduce visual tokens. You can remove up to 77.8% of visual tokens on models like LLaVA-NeXT 7B and 13B while preserving 100.0% and 100.6% performance, respectively. Integrate TOPS to enhance efficiency, mitigate hallucination, and enable more lightweight MLLM designs.
Key insights
TOPS formulates visual token pruning from first principles, constructing Token Optimal Preservation Sets for efficient MLLM inference.
Principles
- Task Relevance guides token selection.
- Information Coverage ensures data completeness.
- Semantic Diversity prevents redundancy.
Method
TOPS is a training-free, model-agnostic pruning module that applies a top-down information-theoretic analysis to construct Token Optimal Preservation Sets based on three fundamental principles.
In practice
- Apply TOPS to various MLLM backbones.
- Reduce visual tokens by 77.8% without performance loss.
- Mitigate MLLM hallucination via pruning.
Topics
- Multimodal LLMs
- Visual Token Pruning
- MLLM Inference Efficiency
- Token Optimal Preservation Sets
- LLaVA-NeXT
- Hallucination Mitigation
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.