Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Summary
"Reroute" is a novel, training-free plug-in designed to address the high inference cost of vision-language models (VLMs) by optimizing visual token handling. Unlike conventional rank-and-remove methods that permanently discard visual tokens, Reroute implements a recoverable routing strategy. This approach allows tokens deemed less important at one decoder stage to bypass processing and re-enter the candidate pool for subsequent routing decisions, acknowledging that token relevance can change across decoder depth. Reroute integrates with existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget of the pruning methods it augments. Evaluated across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, Reroute significantly improves grounding performance under aggressive token reduction while maintaining general VQA accuracy. The work was published on 2026-06-10.
Key takeaway
For Machine Learning Engineers optimizing vision-language model inference, you should reconsider irreversible visual token pruning. Instead, integrate recoverable routing solutions like Reroute to manage KV-cache memory and attention computation. This approach allows you to achieve aggressive token reduction without sacrificing grounding performance, especially for sensitive queries, by ensuring potentially relevant tokens can re-enter processing at later decoder stages.
Key insights
Visual token reduction in VLMs should prioritize recoverable routing over irreversible pruning due to dynamic token importance.
Principles
- Visual token importance varies across VLM decoder layers.
- Irreversible token removal can degrade grounding performance.
- Recoverable routing enhances VLM grounding with aggressive token reduction.
Method
Reroute routes selected visual tokens through decoder blocks while deferring others to bypass and re-enter the candidate pool at subsequent stages, reusing existing ranking rules.
In practice
- Integrate Reroute with existing VLM pruning techniques like FastV or PDrop.
- Apply Reroute to LLaVA-1.5 or Qwen-based VLMs for better grounding.
Topics
- Vision-Language Models
- Visual Token Routing
- Inference Optimization
- KV-Cache
- Model Pruning
- Grounding Performance
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.