OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

OccamToken is a training-free framework designed to enhance the efficiency of Vision-Language Model (VLM) inference by addressing the computational and memory costs associated with long visual token sequences during the prefill stage. It replaces traditional absolute token ranking, which is prone to distortion from attention sinks and unreliable fixed budgets, with a novel register-anchored relative evidence testing approach. This method evaluates whether a visual token offers information beyond a stable, register-based reference, which naturally absorbs low-information attention patterns. OccamToken applies both image-adaptive redundancy pruning and query-adaptive relevance pruning using dynamic thresholds derived from register attention. Evaluated across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off. For instance, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40, maintaining over 93% of original accuracy, even at an extreme 1.4% retention rate.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language Model inference, OccamToken offers a critical solution to reduce computational and memory costs. If you are struggling with expensive prefill stages due to long visual token sequences, your team can implement this training-free framework to achieve significant token compression, such as reducing 2,880 tokens to 40 on LLaVA-NeXT while preserving over 93% accuracy. This enables more efficient VLM deployment without requiring model retraining.

Key insights

OccamToken uses register-anchored relative evidence testing for training-free, budget-adaptive VLM token pruning, improving efficiency.

Principles

Absolute token ranking is brittle for VLM pruning.
Register tokens provide stable reference for information.
Dynamic thresholds enable adaptive pruning.

Method

OccamToken performs register-anchored relative evidence testing, dynamically deriving thresholds from register attention for image-adaptive redundancy pruning and query-adaptive relevance pruning.

In practice

Apply OccamToken to LLaVA-NeXT for token reduction.
Achieve 1.4% token retention with high accuracy.
Improve VLM inference efficiency without retraining.

Topics

Vision-Language Models
Token Pruning
Inference Efficiency
Attention Sinks
LLaVA-NeXT
Register Tokens

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.