TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

2026-06-25 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

TOPS, a novel visual token pruning module, addresses the computational overhead in Multimodal Large Language Models (MLLMs) caused by numerous visual tokens. Existing pruning methods often fall short, either retaining redundant tokens or being instruction-agnostic. This research re-conceptualizes visual token pruning from first principles, formulating it as constructing Token Optimal Preservation Sets through a top-down information-theoretic analysis. TOPS identifies three core principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. The proposed module is training-free and model-agnostic, demonstrating superior performance across 7 MLLM backbones and 14 benchmarks. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting potential for hallucination mitigation and lightweight MLLM design.

Key takeaway

For Machine Learning Engineers optimizing MLLM inference efficiency, TOPS provides a principled, training-free solution to drastically reduce visual tokens. You can remove up to 77.8% of visual tokens on models like LLaVA-NeXT 7B and 13B while preserving 100.0% and 100.6% performance, respectively. Integrate TOPS to enhance efficiency, mitigate hallucination, and enable more lightweight MLLM designs.

Key insights

TOPS formulates visual token pruning from first principles, constructing Token Optimal Preservation Sets for efficient MLLM inference.

Principles

Task Relevance guides token selection.
Information Coverage ensures data completeness.
Semantic Diversity prevents redundancy.

Method

TOPS is a training-free, model-agnostic pruning module that applies a top-down information-theoretic analysis to construct Token Optimal Preservation Sets based on three fundamental principles.

In practice

Apply TOPS to various MLLM backbones.
Reduce visual tokens by 77.8% without performance loss.
Mitigate MLLM hallucination via pruning.

Topics

Multimodal LLMs
Visual Token Pruning
MLLM Inference Efficiency
Token Optimal Preservation Sets
LLaVA-NeXT
Hallucination Mitigation

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.