Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

MMPM is a novel mode-aware framework designed for multimodal pedestrian trajectory prediction from ego-centric camera videos. It addresses the challenge of existing stochastic predictors that often sample sub-optimal "mixed-mode" trajectories by separately modeling future distributions into semantically meaningful modes based on pedestrian crossing behavior. The framework comprises two modules: a behavior-aware Pedestrian Interaction Module (PIM) that captures pedestrian-vehicle and pedestrian-environment interactions using gaze, head, and hand gestures, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module. MTP models future trajectories for two distinct modes: crossing and non-crossing the road. A query-based decoder ensures mode consistency during prediction. Experiments on the PIE and JAAD datasets demonstrate that MMPM surpasses current baselines. The MTP module is model-agnostic, allowing integration into frameworks like BiTrap-NP and SGNet-ED to enhance performance. A new data-driven validation protocol also shows improved frame-wise displacement errors.

Key takeaway

For Computer Vision Engineers developing autonomous driving systems, current pedestrian prediction models often yield implausible "mixed-mode" trajectories. You should consider adopting mode-aware frameworks like MMPM, which explicitly model distinct behaviors such as crossing or non-crossing the road. This approach, incorporating cues like gaze and hand gestures, significantly improves prediction accuracy and plausibility. Integrating its model-agnostic MTP module into your existing systems can directly enhance future trajectory prediction performance.

Key insights

Separately modeling pedestrian crossing and non-crossing behaviors improves multimodal trajectory prediction accuracy.

Principles

Pedestrian intention drives multimodal trajectory distributions.
Incorporating gaze, head, and hand gestures enhances interaction modeling.
Mode-agnostic prediction modules can upgrade existing frameworks.

Method

MMPM uses PIM for interaction capture (gaze, head, hand gestures) and a CVAE-based MTP to model crossing/non-crossing modes, with a query-based decoder enforcing consistency.

In practice

Integrate MTP into existing frameworks like BiTrap-NP or SGNet-ED.
Utilize behavior cues (gaze, gestures) for robust interaction modeling.
Apply the data-driven validation protocol for trajectory evaluation.

Topics

Pedestrian Trajectory Prediction
Multimodal Prediction
Ego-centric Vision
Behavior Modeling
CVAE
Autonomous Driving

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.