Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Summary
Dual-Path Vision Token Routing (DPVR), specifically its Late-Layer Fusion (DPVR-LF) instantiation, addresses the inefficiency in multimodal large language models (MLLMs) that apply uniform computation to image and language tokens. An analysis of LLaVA-1.5 revealed that vision tokens saturate early, with text-to-image attention decreasing from 0.68 at layer 0 to 0.07 by layer 4, and stabilizing near 0.04 after layer 18, while text tokens continue to benefit from deep processing. DPVR-LF routes vision tokens at their saturation point into a one-layer trainable side branch, performs a thirteen-layer text-only forward pass, and re-fuses the visual and textual streams only at the final layer. This approach, utilizing approximately 3% trainable parameters, preserves competitive multimodal performance on standard benchmarks while significantly reducing visual computation in the deep Transformer stack. The findings challenge the assumption that vision tokens require deep language model layers, suggesting a single late fusion layer is sufficient for strong perceptual competence.
Key takeaway
For AI Architects designing or optimizing multimodal large language models, you should reconsider the conventional symmetric Transformer backbone for visual processing. Given that vision tokens saturate early, implementing a late-layer fusion strategy, like DPVR-LF's approach of routing visual information to a shallow side branch and re-fusing at the final layer, can significantly reduce computational overhead. This allows you to maintain strong perceptual competence with substantially fewer trainable parameters, improving efficiency without sacrificing performance.
Key insights
Vision tokens in MLLMs saturate early, making deep symmetric processing inefficient; late-layer fusion is sufficient.
Principles
- MLLMs exhibit modality-asymmetric information density.
- Vision tokens saturate in middle Transformer layers.
- Deep visual computation can be redundant.
Method
DPVR-LF routes saturated vision tokens to a one-layer side branch, runs a 13-layer text-only forward, then re-fuses streams at the final layer.
In practice
- Reduce visual computation in deep Transformer stacks.
- Achieve competitive MLLM performance with ~3% trainable parameters.
Topics
- Multimodal LLMs
- Vision Tokens
- Transformer Architecture
- Late-Layer Fusion
- LLaVA-1.5
- Computational Efficiency
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.