ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

ST-DiffEye is a diffusion-based framework designed for continuous human gaze generation, modeling the patterns a viewer produces when observing visual stimuli. It uniquely addresses gaze variability as an intrinsic property, not noise. Unlike existing models that supervise on either continuous eye-tracking trajectories or discrete scanpaths in isolation, ST-DiffEye jointly models both complementary modalities. This is achieved by concatenating them as an additional raw input channel, requiring minimal architectural overhead. The framework also introduces a principled evaluation method, the Continuous Ranked Probability Score (CRPS), which generalizes existing sequence similarity metrics to assess both accuracy and diversity. Experiments confirm ST-DiffEye achieves state-of-the-art performance on task-driven visual search (target-present and target-absent) and free-viewing benchmarks.

Key takeaway

For Computer Vision Engineers or AI Scientists developing realistic human behavior models, ST-DiffEye demonstrates that jointly modeling continuous eye-tracking trajectories and discrete scanpaths significantly enhances gaze generation accuracy and diversity. You should consider integrating multi-modal gaze data into your generative frameworks and adopt distribution-aware metrics like CRPS for robust evaluation, especially when intrinsic variability is critical. This approach can improve synthetic data realism for training or simulation.

Key insights

ST-DiffEye jointly models gaze trajectories and scanpaths via diffusion to generate diverse, accurate human gaze patterns.

Principles

Method

ST-DiffEye couples gaze trajectories and scanpaths by concatenating them as an additional raw input channel within a diffusion framework, expanding input/output channels without significant architectural overhead.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.