What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Human-Computer Interaction · Depth: Expert, quick

Summary

A new semantic scanpath similarity framework has been developed to enhance eye-movement research by integrating vision-language models (VLMs) into eye-tracking analysis. This framework addresses the limitation of existing methods, which primarily focus on spatial and temporal alignment, by also evaluating semantic equivalence between attended image regions. The approach encodes each fixation using controlled visual contexts, specifically patch-based and marker-based strategies, to generate concise textual descriptions. These descriptions are then aggregated into scanpath-level representations, and semantic similarity is calculated using embedding-based and lexical NLP metrics. Comparative experiments against established spatial measures like MultiMatch and DTW on free-viewing eye-tracking data indicate that semantic similarity captures variance partially independent of geometric alignment, highlighting instances of strong content agreement despite spatial differences. The research also examines how contextual encoding influences description fidelity and metric stability, suggesting that multimodal foundation models can provide interpretable, content-aware extensions to traditional scanpath analysis.

Key takeaway

For eye-tracking researchers and NLP engineers analyzing visual attention, this framework offers a novel way to interpret gaze data beyond mere spatial coordinates. You should consider incorporating semantic similarity metrics to uncover content agreement in scanpaths, especially where spatial alignment is low. This approach provides a richer understanding of visual cognition and user intent, enabling more nuanced insights into human-computer interaction and visual search tasks.

Key insights

VLMs and NLP metrics can quantify semantic similarity in eye-tracking scanpaths, complementing traditional spatial analysis.

Principles

Method

Encode fixations using patch-based or marker-based visual contexts, generate textual descriptions, aggregate into scanpath representations, then compute semantic similarity via NLP embedding or lexical metrics.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.