Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

· Source: Computer Vision and Pattern Recognition · Field: Science & Research — Social Sciences & Behavioral Studies, Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, quick

Summary

A study investigated the alignment between human visual attention and large vision-language model (VLM) attention in safety-relevant environments. Researchers collected eye-tracking data from ten participants viewing 33 scene images with varying risk levels, generating human gaze heatmaps. Concurrently, GPT-4o was prompted via the OpenAI Vision API to produce spatial attention predictions, converted into saliency maps. Spatial alignment was evaluated using Pearson correlation (r = 0.515 ± 0.117), Normalised Scanpath Saliency (NSS = 0.988 ± 0.323), Kullback-Leibler divergence (KL = 1.766 ± 0.844), and AUC-Judd (0.806 ± 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude revealed all models surpassed the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro exhibited the strongest spatial localization, while GPT-4o provided the closest distributional match to human attention. These findings suggest VLMs can approximate human attentional patterns without specific eye-tracking training data.

Key takeaway

For AI Scientists evaluating vision-language models for human-centric applications, this research indicates VLMs can effectively approximate human visual attention in safety-relevant scenes without costly eye-tracking data. You should consider Gemini Pro for strong spatial localization and GPT-4o for closer distributional matches to human attention. This offers a scalable tool for understanding attentional patterns in critical environments, informing design and risk assessment.

Key insights

VLMs can effectively mimic human visual attention in safety-critical scenes, bypassing eye-tracking data needs.

Principles

Method

Collect human gaze heatmaps and VLM saliency maps. Evaluate spatial alignment using Pearson correlation, NSS, KL divergence, and AUC-Judd metrics.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.