MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, short

Summary

MemDreamer is a novel framework designed to enhance Vision-Language Models' (VLMs) ability to understand hours-long videos by decoupling perception and reasoning. Submitted on June 5, 2026, this plug-and-play system addresses token explosion and attention dilution issues by incrementally streaming videos to construct a Hierarchical Graph Memory. This memory features a top-down, three-tier architecture for semantic abstraction, anchored by a foundational graph that captures spatiotemporal and causal relationships. During inference, MemDreamer employs an agentic tool-augmented retrieval mechanism, navigating memory hierarchies and logical edges through an Observation-Reason-Action loop. The framework achieves state-of-the-art results across four mainstream benchmarks, reducing the performance gap with human experts to just 3.7 points. It also constrains the reasoning context window to only 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain.

Key takeaway

For Machine Learning Engineers developing Vision-Language Models for long video analysis, you should consider adopting architectures that decouple perception and reasoning. MemDreamer's approach, utilizing hierarchical graph memory and agentic retrieval, demonstrates significant gains, achieving a 12.5 point accuracy increase while drastically reducing context window requirements to 2%. This suggests prioritizing agentic capabilities and structured memory systems can substantially improve VLM performance on hours-long content, narrowing the gap with human understanding to 3.7 points.

Key insights

MemDreamer decouples perception and reasoning for long video understanding using hierarchical graph memory and agentic retrieval, achieving SOTA.

Principles

Method

MemDreamer incrementally builds a Hierarchical Graph Memory from video streams. Inference uses agentic tool-augmented retrieval, navigating memory hierarchies and logical edges via an Observation-Reason-Action loop.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.