Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

"Reroute" is a novel, training-free plug-in designed to address the high inference cost of vision-language models (VLMs) by optimizing visual token handling. Unlike conventional rank-and-remove methods that permanently discard visual tokens, Reroute implements a recoverable routing strategy. This approach allows tokens deemed less important at one decoder stage to bypass processing and re-enter the candidate pool for subsequent routing decisions, acknowledging that token relevance can change across decoder depth. Reroute integrates with existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget of the pruning methods it augments. Evaluated across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, Reroute significantly improves grounding performance under aggressive token reduction while maintaining general VQA accuracy. The work was published on 2026-06-10.

Key takeaway

For Machine Learning Engineers optimizing vision-language model inference, you should reconsider irreversible visual token pruning. Instead, integrate recoverable routing solutions like Reroute to manage KV-cache memory and attention computation. This approach allows you to achieve aggressive token reduction without sacrificing grounding performance, especially for sensitive queries, by ensuring potentially relevant tokens can re-enter processing at later decoder stages.

Key insights

Visual token reduction in VLMs should prioritize recoverable routing over irreversible pruning due to dynamic token importance.

Principles

Visual token importance varies across VLM decoder layers.
Irreversible token removal can degrade grounding performance.
Recoverable routing enhances VLM grounding with aggressive token reduction.

Method

Reroute routes selected visual tokens through decoder blocks while deferring others to bypass and re-enter the candidate pool at subsequent stages, reusing existing ranking rules.

In practice

Integrate Reroute with existing VLM pruning techniques like FastV or PDrop.
Apply Reroute to LLaVA-1.5 or Qwen-based VLMs for better grounding.

Topics

Vision-Language Models
Visual Token Routing
Inference Optimization
KV-Cache
Model Pruning
Grounding Performance

Code references

elmma/mllm-reroute

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.