OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning
Summary
OccamToken is a training-free framework designed to enhance the efficiency of Vision-Language Model (VLM) inference by addressing the computational and memory costs associated with long visual token sequences during the prefill stage. It replaces traditional absolute token ranking, which is prone to distortion from attention sinks and unreliable fixed budgets, with a novel register-anchored relative evidence testing approach. This method evaluates whether a visual token offers information beyond a stable, register-based reference, which naturally absorbs low-information attention patterns. OccamToken applies both image-adaptive redundancy pruning and query-adaptive relevance pruning using dynamic thresholds derived from register attention. Evaluated across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off. For instance, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40, maintaining over 93% of original accuracy, even at an extreme 1.4% retention rate.
Key takeaway
For Machine Learning Engineers optimizing Vision-Language Model inference, OccamToken offers a critical solution to reduce computational and memory costs. If you are struggling with expensive prefill stages due to long visual token sequences, your team can implement this training-free framework to achieve significant token compression, such as reducing 2,880 tokens to 40 on LLaVA-NeXT while preserving over 93% accuracy. This enables more efficient VLM deployment without requiring model retraining.
Key insights
OccamToken uses register-anchored relative evidence testing for training-free, budget-adaptive VLM token pruning, improving efficiency.
Principles
- Absolute token ranking is brittle for VLM pruning.
- Register tokens provide stable reference for information.
- Dynamic thresholds enable adaptive pruning.
Method
OccamToken performs register-anchored relative evidence testing, dynamically deriving thresholds from register attention for image-adaptive redundancy pruning and query-adaptive relevance pruning.
In practice
- Apply OccamToken to LLaVA-NeXT for token reduction.
- Achieve 1.4% token retention with high accuracy.
- Improve VLM inference efficiency without retraining.
Topics
- Vision-Language Models
- Token Pruning
- Inference Efficiency
- Attention Sinks
- LLaVA-NeXT
- Register Tokens
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.