EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
Summary
EgoExoMem is introduced as the first benchmark for cross-view memory reasoning, utilizing synchronized egocentric and exocentric videos. This benchmark comprises 2.6K high-quality multiple-choice questions (MCQs) spanning eight distinct temporal, spatial, and cross-view question-answering types. To facilitate dual-view retrieval, the authors propose E^2-Select, a training-free frame selection method designed for synchronized ego-exo videos. E^2-Select integrates relevance-based budget allocation with per-view k-DPP sampling to address view asymmetry and maintain cross-view temporal consistency. Experimental results indicate that egocentric and exocentric views offer complementary memory cues. Existing Multi-modal Large Language Models (MLLMs) achieve only 55.3% accuracy on the benchmark, while E^2-Select sets a new state-of-the-art with 58.2% performance over other frame-selection and RAG-based memory baselines. Analysis highlights systematic view-preference conflicts between question framing and answer grounding, emphasizing the complexity of cross-view memory reasoning.
Key takeaway
For research scientists developing embodied intelligence or video understanding systems, EgoExoMem highlights the necessity of integrating both egocentric and exocentric perspectives for robust spatial-temporal reasoning. You should consider adopting dual-view approaches and frame selection methods like E^2-Select to improve performance, as current MLLMs struggle with cross-view memory, indicating a significant area for advancement in your model architectures.
Key insights
EgoExoMem is a new benchmark for cross-view memory reasoning using synchronized egocentric and exocentric videos.
Principles
- Ego and exo views provide complementary memory cues.
- View-preference conflicts challenge cross-view reasoning.
Method
E^2-Select uses relevance-based budget allocation and per-view k-DPP sampling for training-free frame selection in synchronized ego-exo videos.
In practice
- Use E^2-Select for dual-view video retrieval.
- Integrate ego and exo views for comprehensive reasoning.
Topics
- EgoExoMem Benchmark
- Cross-View Memory Reasoning
- Egocentric-Exocentric Videos
- E$^2$-Select Frame Selection
- Multimodal Large Language Models
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.