MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
Summary
MementoGUI is a novel plug-in agentic memory framework designed to enhance Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents in long-horizon tasks. It addresses the limitations of existing agents that struggle with maintaining task state across numerous interface transitions, often relying on raw history replay or text-only memory. MementoGUI introduces MementoCore, a learned controller that manages online memory selection, compression, and retrieval. This system selectively preserves task-relevant interface events, including textual summaries and Region-of-Interest (ROI)-level visual evidence, in working memory, while episodic memory retrieves reusable past trajectories. The framework also includes a scalable data curation pipeline for training memory controllers and MementoGUI-Bench, a new benchmark for evaluating long-horizon GUI agent decision-making. Experiments across GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench demonstrate MementoGUI's superior performance compared to baselines.
Key takeaway
For research scientists developing GUI agents for complex, multi-step tasks, MementoGUI offers a robust solution to overcome memory limitations. You should consider integrating agentic memory control to manage interaction history, preserving critical visual and textual context without overwhelming the model. This approach can significantly improve task completion rates and agent reliability in long-horizon scenarios.
Key insights
MementoGUI improves long-horizon GUI agents by learning to selectively manage multimodal memory, preventing information overload and loss.
Principles
- Treat long-horizon GUI control as an online memory-control problem.
- Modularize memory control into specialized operators.
Method
MementoGUI uses MementoCore for online memory selection, compression, and retrieval, preserving task-relevant visual and textual evidence in working memory and retrieving past trajectories from episodic memory.
In practice
- Augment MLLM-based GUI agents with plug-in memory control.
- Utilize ROI-level visual evidence for future decisions.
Topics
- MementoGUI Framework
- Agentic Memory Control
- Long-Horizon GUI Agents
- Multimodal Memory
- MementoCore Controller
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.