Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models
Summary
Researchers studied decoding pedestrian crossing intentions from short egocentric video clips, formulating the task as a closed-ended visual question answering (VQA) problem using vision language models (VLMs). Initial zero-shot benchmarking of three VLM families showed moderate gains over random guessing but limited higher-level traffic reasoning. Through parameter-efficient fine-tuning, models substantially outperformed zero-shot counterparts, achieving a 9% accuracy improvement over a specialized transformer-based baseline. Further incorporating contextual cues like ego motion, vehicle motion, and eye gaze boosted predictive performance. Specifically, the fine-tuned Qwen3-VL-2B model, guided by eye gaze and ego motion, established a new benchmark with a 14.5% accuracy improvement over the transformer baseline for egocentric pedestrian intent decoding.
Key takeaway
For Computer Vision Engineers developing advanced driver-assistance systems or pedestrian safety features, you should consider fine-tuning Vision Language Models for egocentric intent prediction. Integrating contextual cues like ego motion and eye gaze can significantly boost accuracy. Specifically, utilizing a fine-tuned Qwen3-VL-2B model with these cues can achieve a 14.5% accuracy improvement, setting a new benchmark for proactive traffic safety.
Key insights
Vision Language Models, fine-tuned with egocentric and gaze cues, significantly improve pedestrian crossing intention prediction.
Principles
- Egocentric vision offers unique traffic safety insights.
- VQA effectively models complex intent prediction.
- Contextual cues enhance VLM performance.
Method
Formulate intent prediction as closed-ended VQA. Benchmark zero-shot VLMs, then apply parameter-efficient fine-tuning. Integrate contextual cues like ego motion, vehicle motion, and eye gaze for performance gains.
In practice
- Fine-tune VLMs for specific egocentric intent tasks.
- Integrate ego motion and eye gaze as VLM context.
- Evaluate Qwen3-VL-2B for egocentric vision applications.
Topics
- Egocentric Vision
- Pedestrian Intention Prediction
- Vision Language Models
- Visual Question Answering
- Parameter-Efficient Fine-Tuning
- Traffic Safety
- Qwen3-VL-2B
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.