MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
Summary
MemDreamer is a novel framework designed to enhance Vision-Language Models' (VLMs) ability to understand hours-long videos by decoupling perception and reasoning. Submitted on June 5, 2026, this plug-and-play system addresses token explosion and attention dilution issues by incrementally streaming videos to construct a Hierarchical Graph Memory. This memory features a top-down, three-tier architecture for semantic abstraction, anchored by a foundational graph that captures spatiotemporal and causal relationships. During inference, MemDreamer employs an agentic tool-augmented retrieval mechanism, navigating memory hierarchies and logical edges through an Observation-Reason-Action loop. The framework achieves state-of-the-art results across four mainstream benchmarks, reducing the performance gap with human experts to just 3.7 points. It also constrains the reasoning context window to only 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain.
Key takeaway
For Machine Learning Engineers developing Vision-Language Models for long video analysis, you should consider adopting architectures that decouple perception and reasoning. MemDreamer's approach, utilizing hierarchical graph memory and agentic retrieval, demonstrates significant gains, achieving a 12.5 point accuracy increase while drastically reducing context window requirements to 2%. This suggests prioritizing agentic capabilities and structured memory systems can substantially improve VLM performance on hours-long content, narrowing the gap with human understanding to 3.7 points.
Key insights
MemDreamer decouples perception and reasoning for long video understanding using hierarchical graph memory and agentic retrieval, achieving SOTA.
Principles
- Decouple perception and reasoning for long video tasks.
- Hierarchical graph memory aids semantic abstraction.
- Agentic capability scales multimodal comprehension.
Method
MemDreamer incrementally builds a Hierarchical Graph Memory from video streams. Inference uses agentic tool-augmented retrieval, navigating memory hierarchies and logical edges via an Observation-Reason-Action loop.
In practice
- Achieve 12.5 point accuracy gain in VLMs.
- Reduce reasoning context to 2% of full video.
- Improve VLM performance on long video benchmarks.
Topics
- MemDreamer
- Long Video Understanding
- Vision-Language Models
- Hierarchical Graph Memory
- Agentic AI
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.