EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, quick

Summary

EgoExoMem is introduced as the first benchmark for cross-view memory reasoning, utilizing synchronized egocentric and exocentric videos. This benchmark comprises 2.6K high-quality multiple-choice questions (MCQs) spanning eight distinct temporal, spatial, and cross-view question-answering types. To facilitate dual-view retrieval, the authors propose E^2-Select, a training-free frame selection method designed for synchronized ego-exo videos. E^2-Select integrates relevance-based budget allocation with per-view k-DPP sampling to address view asymmetry and maintain cross-view temporal consistency. Experimental results indicate that egocentric and exocentric views offer complementary memory cues. Existing Multi-modal Large Language Models (MLLMs) achieve only 55.3% accuracy on the benchmark, while E^2-Select sets a new state-of-the-art with 58.2% performance over other frame-selection and RAG-based memory baselines. Analysis highlights systematic view-preference conflicts between question framing and answer grounding, emphasizing the complexity of cross-view memory reasoning.

Key takeaway

For research scientists developing embodied intelligence or video understanding systems, EgoExoMem highlights the necessity of integrating both egocentric and exocentric perspectives for robust spatial-temporal reasoning. You should consider adopting dual-view approaches and frame selection methods like E^2-Select to improve performance, as current MLLMs struggle with cross-view memory, indicating a significant area for advancement in your model architectures.

Key insights

EgoExoMem is a new benchmark for cross-view memory reasoning using synchronized egocentric and exocentric videos.

Principles

Method

E^2-Select uses relevance-based budget allocation and per-view k-DPP sampling for training-free frame selection in synchronized ego-exo videos.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.