LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
Summary
The research introduces LongSpace-Bench, a new room-tour video benchmark designed to evaluate long-horizon spatial memory in Multimodal Large Language Models (MLLMs). This benchmark comprises 445 real-world room-tour videos, totaling approximately 159 hours, and features 4,073 question-answer pairs across tasks like scene perception, spatial relations, and spatial memory. To address the identified limitations in MLLMs, the authors propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace processes long videos as sequential chunks, integrates 3D structural cues into early decoder layers, and builds layer-aware memory for question-guided retrieval. Experiments confirm that LongSpace significantly enhances long-video spatial understanding, highlighting explicit spatial memory as crucial for long-horizon video MLLMs.
Key takeaway
For AI Scientists and Machine Learning Engineers developing MLLMs for autonomous systems or embodied AI, you should prioritize explicit long-horizon spatial memory. Integrating 3D structural cues and hierarchical, query-guided memory mechanisms, as demonstrated by LongSpace, is essential. This approach allows your models to retain and retrieve critical spatial evidence over extended video observations, moving beyond short-term context limitations and improving reasoning in complex, dynamic environments.
Key insights
Long-horizon spatial memory in MLLMs requires integrating 3D structural cues and hierarchical, question-guided memory retrieval.
Principles
- Spatial evidence exhibits structural persistence.
- Geometry-enhanced models improve spatial representations.
- Structured memory is vital for long-term scene information.
Method
LongSpace processes videos in chunks, fuses 3D spatial tokens into decoder layers, and constructs hierarchical KV memory with role-conditioned evidence selection and budget-constrained compression for retrieval.
In practice
- Use 3D geometry encoders for spatial cues.
- Implement layer-aware memory for video chunks.
- Prioritize memory entries by salience and recency.
Topics
- Multimodal Large Language Models
- Spatial Memory
- Video Understanding
- 3D Geometry
- Autonomous Driving
- Robotic Navigation
- LongSpace-Bench
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.