Attention Alignment Between Humans and Vision-Language Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A study compared spatial attention maps from six vision-language models (VLMs) against human fixation heatmaps across 200 images and two tasks (general description and social captioning). The models included a 2x2 factorial of CNN/ViT encoders with LSTM/Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. Findings indicate that decoder architecture significantly shaped alignment, with LSTM decoders increasing alignment by 40-50 percentage points (80-87% vs. 40-59% of human noise ceiling). Encoder choice contributed a secondary 5-20 point advantage, making CNN-LSTM the most aligned model overall (85-87%). However, LSTM-decoder attention was diffuse and less task-differentiated, while ViT-Transformer showed sharper concentration despite weaker alignment. A hemispatial-neglect simulation confirmed greater impact on LSTM decoders. Exploratory TRIBE-simulated neural responses suggested CNN-Transformer attention maps better predicted synthetic brain activity, particularly in early visual cortex, despite lower fixation alignment, highlighting a trade-off between behavioral and neural predictability.

Key takeaway

For AI Scientists and Machine Learning Engineers selecting Vision-Language Models, your choice of decoder architecture is paramount for achieving human-like visual attention alignment. While LSTM decoders offer superior fixation alignment (e.g., CNN-LSTM at 85-87%), be aware this may come with diffuse attention and less task differentiation. If your goal is predicting synthetic neural activity, CNN-Transformer models might be more effective, even with lower fixation alignment. Align your VLM architecture selection with your specific behavioral or neural prediction objectives.

Key insights

Decoder architecture significantly impacts vision-language model attention alignment with human fixations, often more than encoder choice.

Principles

Method

Six vision-language models' spatial attention maps were compared against human fixation heatmaps on 200 images across two tasks, supplemented by hemispatial-neglect and TRIBE-simulated neural response analyses.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.