OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

OccamToken is a training-free framework designed to enhance the efficiency of Vision-Language Model (VLM) inference by addressing the computational and memory costs associated with long visual token sequences during the prefill stage. It replaces traditional absolute token ranking, which is prone to distortion from attention sinks and unreliable fixed budgets, with a novel register-anchored relative evidence testing approach. This method evaluates whether a visual token offers information beyond a stable, register-based reference, which naturally absorbs low-information attention patterns. OccamToken applies both image-adaptive redundancy pruning and query-adaptive relevance pruning using dynamic thresholds derived from register attention. Evaluated across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off. For instance, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40, maintaining over 93% of original accuracy, even at an extreme 1.4% retention rate.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language Model inference, OccamToken offers a critical solution to reduce computational and memory costs. If you are struggling with expensive prefill stages due to long visual token sequences, your team can implement this training-free framework to achieve significant token compression, such as reducing 2,880 tokens to 40 on LLaVA-NeXT while preserving over 93% accuracy. This enables more efficient VLM deployment without requiring model retraining.

Key insights

OccamToken uses register-anchored relative evidence testing for training-free, budget-adaptive VLM token pruning, improving efficiency.

Principles

Method

OccamToken performs register-anchored relative evidence testing, dynamically deriving thresholds from register attention for image-adaptive redundancy pruning and query-adaptive relevance pruning.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.