Attention Alignment Between Humans and Vision-Language Models
Summary
A study compared spatial attention maps from six vision-language models (VLMs) against human fixation heatmaps across 200 images and two tasks (general description and social captioning). The models included a 2x2 factorial of CNN/ViT encoders with LSTM/Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. Findings indicate that decoder architecture significantly shaped alignment, with LSTM decoders increasing alignment by 40-50 percentage points (80-87% vs. 40-59% of human noise ceiling). Encoder choice contributed a secondary 5-20 point advantage, making CNN-LSTM the most aligned model overall (85-87%). However, LSTM-decoder attention was diffuse and less task-differentiated, while ViT-Transformer showed sharper concentration despite weaker alignment. A hemispatial-neglect simulation confirmed greater impact on LSTM decoders. Exploratory TRIBE-simulated neural responses suggested CNN-Transformer attention maps better predicted synthetic brain activity, particularly in early visual cortex, despite lower fixation alignment, highlighting a trade-off between behavioral and neural predictability.
Key takeaway
For AI Scientists and Machine Learning Engineers selecting Vision-Language Models, your choice of decoder architecture is paramount for achieving human-like visual attention alignment. While LSTM decoders offer superior fixation alignment (e.g., CNN-LSTM at 85-87%), be aware this may come with diffuse attention and less task differentiation. If your goal is predicting synthetic neural activity, CNN-Transformer models might be more effective, even with lower fixation alignment. Align your VLM architecture selection with your specific behavioral or neural prediction objectives.
Key insights
Decoder architecture significantly impacts vision-language model attention alignment with human fixations, often more than encoder choice.
Principles
- Decoder choice dominates VLM attention alignment.
- Alignment can trade off with spatial concentration.
- Fixation alignment and neural relevance can dissociate.
Method
Six vision-language models' spatial attention maps were compared against human fixation heatmaps on 200 images across two tasks, supplemented by hemispatial-neglect and TRIBE-simulated neural response analyses.
In practice
- Prioritize decoder architecture for human-like attention.
- Consider CNN-LSTM for high fixation alignment.
- Evaluate models beyond fixation alignment for neural relevance.
Topics
- Vision-Language Models
- Attention Mechanisms
- Human Fixation
- Decoder Architecture
- Encoder Architecture
- Neural Relevance
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.