Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments
Summary
A study investigated the alignment between human visual attention and large vision-language model (VLM) attention in safety-relevant environments. Researchers collected eye-tracking data from ten participants viewing 33 scene images with varying risk levels, generating human gaze heatmaps. Concurrently, GPT-4o was prompted via the OpenAI Vision API to produce spatial attention predictions, converted into saliency maps. Spatial alignment was evaluated using Pearson correlation (r = 0.515 ± 0.117), Normalised Scanpath Saliency (NSS = 0.988 ± 0.323), Kullback-Leibler divergence (KL = 1.766 ± 0.844), and AUC-Judd (0.806 ± 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude revealed all models surpassed the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro exhibited the strongest spatial localization, while GPT-4o provided the closest distributional match to human attention. These findings suggest VLMs can approximate human attentional patterns without specific eye-tracking training data.
Key takeaway
For AI Scientists evaluating vision-language models for human-centric applications, this research indicates VLMs can effectively approximate human visual attention in safety-relevant scenes without costly eye-tracking data. You should consider Gemini Pro for strong spatial localization and GPT-4o for closer distributional matches to human attention. This offers a scalable tool for understanding attentional patterns in critical environments, informing design and risk assessment.
Key insights
VLMs can effectively mimic human visual attention in safety-critical scenes, bypassing eye-tracking data needs.
Principles
- VLMs identify human-attended scene regions.
- Eye-tracking data is not needed for VLM attention prediction.
- Different VLMs show strengths in distinct metrics.
Method
Collect human gaze heatmaps and VLM saliency maps. Evaluate spatial alignment using Pearson correlation, NSS, KL divergence, and AUC-Judd metrics.
In practice
- Approximate human attention patterns scalably with VLMs.
- Use multiple metrics for VLM attention evaluation.
- Apply VLMs for safety-relevant scene analysis.
Topics
- Vision-Language Models
- Human Gaze
- Attention Mechanisms
- Safety-Relevant Environments
- Spatial Alignment
- GPT-4o
- Gemini Pro
Best for: Computer Vision Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.