Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

2026-04-09 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Multimodal Mixture-of-Experts (MoE) models, despite strong vision-language task performance, exhibit a "Seeing but Not Thinking" phenomenon where they perceive images but fail reasoning tasks they solve as text. Analysis confirms cross-modal semantic sharing in MoE architectures, disproving semantic alignment failure as the sole cause. The issue stems from layer-wise separation of visual and domain experts, with image inputs causing significant routing divergence in middle layers where domain experts are concentrated. This leads to the Routing Distraction hypothesis: the routing mechanism inadequately activates task-relevant reasoning experts for visual inputs. A routing-guided intervention method, designed to enhance domain expert activation, consistently improved performance by up to 3.17% on complex visual reasoning tasks across three multimodal MoE models and six benchmarks. This intervention also showed that domain expert identification targets cognitive functions, allowing transfer across tasks.

Key takeaway

For Computer Vision Engineers developing multimodal MoE models, you should investigate your model's routing mechanisms, especially in middle layers, to ensure proper activation of reasoning experts for visual inputs. Implementing routing-guided interventions can significantly improve performance on complex visual reasoning tasks, potentially yielding gains of up to 3.17% by addressing "Routing Distraction" and improving cognitive function transfer.

Key insights

Multimodal MoE models struggle with visual reasoning due to routing distraction, failing to activate task-relevant experts.

Principles

Cross-modal semantic sharing exists in MoE.
Visual and domain experts separate layer-wise.
Routing divergence impacts visual reasoning.

Method

A routing-guided intervention method enhances domain expert activation to mitigate "Seeing but Not Thinking" in multimodal MoE models, improving visual reasoning performance.

In practice

Enhance domain expert activation for visual inputs.
Identify cognitive functions via domain expert analysis.

Topics

Multimodal Mixture-of-Experts
Vision-Language Tasks
Routing Distraction
Domain Experts
Visual Reasoning

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.