Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Researchers studied decoding pedestrian crossing intentions from short egocentric video clips, formulating the task as a closed-ended visual question answering (VQA) problem using vision language models (VLMs). Initial zero-shot benchmarking of three VLM families showed moderate gains over random guessing but limited higher-level traffic reasoning. Through parameter-efficient fine-tuning, models substantially outperformed zero-shot counterparts, achieving a 9% accuracy improvement over a specialized transformer-based baseline. Further incorporating contextual cues like ego motion, vehicle motion, and eye gaze boosted predictive performance. Specifically, the fine-tuned Qwen3-VL-2B model, guided by eye gaze and ego motion, established a new benchmark with a 14.5% accuracy improvement over the transformer baseline for egocentric pedestrian intent decoding.

Key takeaway

For Computer Vision Engineers developing advanced driver-assistance systems or pedestrian safety features, you should consider fine-tuning Vision Language Models for egocentric intent prediction. Integrating contextual cues like ego motion and eye gaze can significantly boost accuracy. Specifically, utilizing a fine-tuned Qwen3-VL-2B model with these cues can achieve a 14.5% accuracy improvement, setting a new benchmark for proactive traffic safety.

Key insights

Vision Language Models, fine-tuned with egocentric and gaze cues, significantly improve pedestrian crossing intention prediction.

Principles

Method

Formulate intent prediction as closed-ended VQA. Benchmark zero-shot VLMs, then apply parameter-efficient fine-tuning. Integrate contextual cues like ego motion, vehicle motion, and eye gaze for performance gains.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.