LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

The research introduces LongSpace-Bench, a new room-tour video benchmark designed to evaluate long-horizon spatial memory in Multimodal Large Language Models (MLLMs). This benchmark comprises 445 real-world room-tour videos, totaling approximately 159 hours, and features 4,073 question-answer pairs across tasks like scene perception, spatial relations, and spatial memory. To address the identified limitations in MLLMs, the authors propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace processes long videos as sequential chunks, integrates 3D structural cues into early decoder layers, and builds layer-aware memory for question-guided retrieval. Experiments confirm that LongSpace significantly enhances long-video spatial understanding, highlighting explicit spatial memory as crucial for long-horizon video MLLMs.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLMs for autonomous systems or embodied AI, you should prioritize explicit long-horizon spatial memory. Integrating 3D structural cues and hierarchical, query-guided memory mechanisms, as demonstrated by LongSpace, is essential. This approach allows your models to retain and retrieve critical spatial evidence over extended video observations, moving beyond short-term context limitations and improving reasoning in complex, dynamic environments.

Key insights

Long-horizon spatial memory in MLLMs requires integrating 3D structural cues and hierarchical, question-guided memory retrieval.

Principles

Method

LongSpace processes videos in chunks, fuses 3D spatial tokens into decoder layers, and constructs hierarchical KV memory with role-conditioned evidence selection and budget-constrained compression for retrieval.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.