Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Interpretability, Multimodal AI · Depth: Expert, extended

Summary

A study on Multimodal Large Language Models (MLLMs) identifies "Context-aware Retrieval (CoRe) heads," a specialized subset of attention heads crucial for extracting query-relevant visual features from complex contexts. Using a token-level metric called Retrieval Attention Mass (RAM), researchers found functional sparsity: CoRe heads precisely localize relevant entities, while most other heads distribute attention broadly. Causal interventions demonstrated that ablating just the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal impact. Furthermore, acceleration experiments validated that utilizing this localized sparsity can achieve substantial inference speedups, such as 1.8x to 2.1x, across models like Qwen3-VL (4B, 8B, 32B), LLaVA-OneVision, and InternVL3.5, while maintaining robust task performance on benchmarks including RefCOCOg, VidSTG, MMDocIR, MMLongBench, and MLVU.

Key takeaway

For Machine Learning Engineers optimizing MLLM deployment, understanding CoRe heads is crucial. You should consider implementing CoRe head-guided hybrid attention strategies to achieve significant inference speedups, potentially 1.8x to 2.1x, without compromising performance. This approach allows you to prune non-essential attention computations, especially for long visual contexts, enhancing efficiency and scalability in your multimodal applications. Focus your optimization efforts on these critical heads for robust and faster models.

Key insights

Multimodal LLMs exhibit functional sparsity, relying on a sparse subset of "CoRe heads" for efficient, precise cross-modal visual information retrieval.

Principles

Method

CoRe heads are identified using Retrieval Attention Mass (RAM), a token-level metric quantifying attention from query tokens to ground-truth visual regions, after partitioning multimodal input and extracting text-to-vision attention maps.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.