MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, short

Summary

MemDreamer is a novel framework designed to enhance Vision-Language Models' (VLMs) ability to understand hours-long videos by decoupling perception and reasoning. Submitted on June 5, 2026, this plug-and-play system addresses token explosion and attention dilution issues by incrementally streaming videos to construct a Hierarchical Graph Memory. This memory features a top-down, three-tier architecture for semantic abstraction, anchored by a foundational graph that captures spatiotemporal and causal relationships. During inference, MemDreamer employs an agentic tool-augmented retrieval mechanism, navigating memory hierarchies and logical edges through an Observation-Reason-Action loop. The framework achieves state-of-the-art results across four mainstream benchmarks, reducing the performance gap with human experts to just 3.7 points. It also constrains the reasoning context window to only 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain.

Key takeaway

For Machine Learning Engineers developing Vision-Language Models for long video analysis, you should consider adopting architectures that decouple perception and reasoning. MemDreamer's approach, utilizing hierarchical graph memory and agentic retrieval, demonstrates significant gains, achieving a 12.5 point accuracy increase while drastically reducing context window requirements to 2%. This suggests prioritizing agentic capabilities and structured memory systems can substantially improve VLM performance on hours-long content, narrowing the gap with human understanding to 3.7 points.

Key insights

MemDreamer decouples perception and reasoning for long video understanding using hierarchical graph memory and agentic retrieval, achieving SOTA.

Principles

Decouple perception and reasoning for long video tasks.
Hierarchical graph memory aids semantic abstraction.
Agentic capability scales multimodal comprehension.

Method

MemDreamer incrementally builds a Hierarchical Graph Memory from video streams. Inference uses agentic tool-augmented retrieval, navigating memory hierarchies and logical edges via an Observation-Reason-Action loop.

In practice

Achieve 12.5 point accuracy gain in VLMs.
Reduce reasoning context to 2% of full video.
Improve VLM performance on long video benchmarks.

Topics

MemDreamer
Long Video Understanding
Vision-Language Models
Hierarchical Graph Memory
Agentic AI
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.