Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Interpretability, Multimodal AI · Depth: Expert, extended

Summary

A study on Multimodal Large Language Models (MLLMs) identifies "Context-aware Retrieval (CoRe) heads," a specialized subset of attention heads crucial for extracting query-relevant visual features from complex contexts. Using a token-level metric called Retrieval Attention Mass (RAM), researchers found functional sparsity: CoRe heads precisely localize relevant entities, while most other heads distribute attention broadly. Causal interventions demonstrated that ablating just the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal impact. Furthermore, acceleration experiments validated that utilizing this localized sparsity can achieve substantial inference speedups, such as 1.8x to 2.1x, across models like Qwen3-VL (4B, 8B, 32B), LLaVA-OneVision, and InternVL3.5, while maintaining robust task performance on benchmarks including RefCOCOg, VidSTG, MMDocIR, MMLongBench, and MLVU.

Key takeaway

For Machine Learning Engineers optimizing MLLM deployment, understanding CoRe heads is crucial. You should consider implementing CoRe head-guided hybrid attention strategies to achieve significant inference speedups, potentially 1.8x to 2.1x, without compromising performance. This approach allows you to prune non-essential attention computations, especially for long visual contexts, enhancing efficiency and scalability in your multimodal applications. Focus your optimization efforts on these critical heads for robust and faster models.

Key insights

Multimodal LLMs exhibit functional sparsity, relying on a sparse subset of "CoRe heads" for efficient, precise cross-modal visual information retrieval.

Principles

MLLMs exhibit functional sparsity in cross-modal retrieval.
CoRe heads concentrate in middle-to-late layers with scaling.
Multimodal reasoning critically depends on CoRe heads.

Method

CoRe heads are identified using Retrieval Attention Mass (RAM), a token-level metric quantifying attention from query tokens to ground-truth visual regions, after partitioning multimodal input and extracting text-to-vision attention maps.

In practice

Optimize MLLMs by selectively preserving CoRe heads.
Accelerate inference using CoRe head-guided hybrid attention.
Guide interpretability studies to CoRe head mechanisms.

Topics

Multimodal LLMs
Mechanistic Interpretability
Attention Heads
Functional Sparsity
Inference Acceleration
Cross-modal Retrieval
Qwen3-VL

Code references

aaxiyao/CoRe_Head

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.