LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

2025-11-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

The research introduces LongSpace-Bench, a new room-tour video benchmark designed to evaluate long-horizon spatial memory in Multimodal Large Language Models (MLLMs). This benchmark comprises 445 real-world room-tour videos, totaling approximately 159 hours, and features 4,073 question-answer pairs across tasks like scene perception, spatial relations, and spatial memory. To address the identified limitations in MLLMs, the authors propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace processes long videos as sequential chunks, integrates 3D structural cues into early decoder layers, and builds layer-aware memory for question-guided retrieval. Experiments confirm that LongSpace significantly enhances long-video spatial understanding, highlighting explicit spatial memory as crucial for long-horizon video MLLMs.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLMs for autonomous systems or embodied AI, you should prioritize explicit long-horizon spatial memory. Integrating 3D structural cues and hierarchical, query-guided memory mechanisms, as demonstrated by LongSpace, is essential. This approach allows your models to retain and retrieve critical spatial evidence over extended video observations, moving beyond short-term context limitations and improving reasoning in complex, dynamic environments.

Key insights

Long-horizon spatial memory in MLLMs requires integrating 3D structural cues and hierarchical, question-guided memory retrieval.

Principles

Spatial evidence exhibits structural persistence.
Geometry-enhanced models improve spatial representations.
Structured memory is vital for long-term scene information.

Method

LongSpace processes videos in chunks, fuses 3D spatial tokens into decoder layers, and constructs hierarchical KV memory with role-conditioned evidence selection and budget-constrained compression for retrieval.

In practice

Use 3D geometry encoders for spatial cues.
Implement layer-aware memory for video chunks.
Prioritize memory entries by salience and recency.

Topics

Multimodal Large Language Models
Spatial Memory
Video Understanding
3D Geometry
Autonomous Driving
Robotic Navigation
LongSpace-Bench

Code references

ShiqiangLang/LongSpace

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.