Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads
Summary
A study on Multimodal Large Language Models (MLLMs) identifies "Context-aware Retrieval (CoRe) heads," a specialized subset of attention heads crucial for extracting query-relevant visual features from complex contexts. Using a token-level metric called Retrieval Attention Mass (RAM), researchers found functional sparsity: CoRe heads precisely localize relevant entities, while most other heads distribute attention broadly. Causal interventions demonstrated that ablating just the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal impact. Furthermore, acceleration experiments validated that utilizing this localized sparsity can achieve substantial inference speedups, such as 1.8x to 2.1x, across models like Qwen3-VL (4B, 8B, 32B), LLaVA-OneVision, and InternVL3.5, while maintaining robust task performance on benchmarks including RefCOCOg, VidSTG, MMDocIR, MMLongBench, and MLVU.
Key takeaway
For Machine Learning Engineers optimizing MLLM deployment, understanding CoRe heads is crucial. You should consider implementing CoRe head-guided hybrid attention strategies to achieve significant inference speedups, potentially 1.8x to 2.1x, without compromising performance. This approach allows you to prune non-essential attention computations, especially for long visual contexts, enhancing efficiency and scalability in your multimodal applications. Focus your optimization efforts on these critical heads for robust and faster models.
Key insights
Multimodal LLMs exhibit functional sparsity, relying on a sparse subset of "CoRe heads" for efficient, precise cross-modal visual information retrieval.
Principles
- MLLMs exhibit functional sparsity in cross-modal retrieval.
- CoRe heads concentrate in middle-to-late layers with scaling.
- Multimodal reasoning critically depends on CoRe heads.
Method
CoRe heads are identified using Retrieval Attention Mass (RAM), a token-level metric quantifying attention from query tokens to ground-truth visual regions, after partitioning multimodal input and extracting text-to-vision attention maps.
In practice
- Optimize MLLMs by selectively preserving CoRe heads.
- Accelerate inference using CoRe head-guided hybrid attention.
- Guide interpretability studies to CoRe head mechanisms.
Topics
- Multimodal LLMs
- Mechanistic Interpretability
- Attention Heads
- Functional Sparsity
- Inference Acceleration
- Cross-modal Retrieval
- Qwen3-VL
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.