Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

2026-06-13 · Source: Computer Vision and Pattern Recognition · Field: Science & Research — Social Sciences & Behavioral Studies, Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, quick

Summary

A study investigated the alignment between human visual attention and large vision-language model (VLM) attention in safety-relevant environments. Researchers collected eye-tracking data from ten participants viewing 33 scene images with varying risk levels, generating human gaze heatmaps. Concurrently, GPT-4o was prompted via the OpenAI Vision API to produce spatial attention predictions, converted into saliency maps. Spatial alignment was evaluated using Pearson correlation (r = 0.515 ± 0.117), Normalised Scanpath Saliency (NSS = 0.988 ± 0.323), Kullback-Leibler divergence (KL = 1.766 ± 0.844), and AUC-Judd (0.806 ± 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude revealed all models surpassed the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro exhibited the strongest spatial localization, while GPT-4o provided the closest distributional match to human attention. These findings suggest VLMs can approximate human attentional patterns without specific eye-tracking training data.

Key takeaway

For AI Scientists evaluating vision-language models for human-centric applications, this research indicates VLMs can effectively approximate human visual attention in safety-relevant scenes without costly eye-tracking data. You should consider Gemini Pro for strong spatial localization and GPT-4o for closer distributional matches to human attention. This offers a scalable tool for understanding attentional patterns in critical environments, informing design and risk assessment.

Key insights

VLMs can effectively mimic human visual attention in safety-critical scenes, bypassing eye-tracking data needs.

Principles

VLMs identify human-attended scene regions.
Eye-tracking data is not needed for VLM attention prediction.
Different VLMs show strengths in distinct metrics.

Method

Collect human gaze heatmaps and VLM saliency maps. Evaluate spatial alignment using Pearson correlation, NSS, KL divergence, and AUC-Judd metrics.

In practice

Approximate human attention patterns scalably with VLMs.
Use multiple metrics for VLM attention evaluation.
Apply VLMs for safety-relevant scene analysis.

Topics

Vision-Language Models
Human Gaze
Attention Mechanisms
Safety-Relevant Environments
Spatial Alignment
GPT-4o
Gemini Pro

Best for: Computer Vision Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.