MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

MementoGUI is a novel plug-in agentic memory framework designed to enhance Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents in long-horizon tasks. It addresses the limitations of existing agents that struggle with maintaining task state across numerous interface transitions, often relying on raw history replay or text-only memory. MementoGUI introduces MementoCore, a learned controller that manages online memory selection, compression, and retrieval. This system selectively preserves task-relevant interface events, including textual summaries and Region-of-Interest (ROI)-level visual evidence, in working memory, while episodic memory retrieves reusable past trajectories. The framework also includes a scalable data curation pipeline for training memory controllers and MementoGUI-Bench, a new benchmark for evaluating long-horizon GUI agent decision-making. Experiments across GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench demonstrate MementoGUI's superior performance compared to baselines.

Key takeaway

For research scientists developing GUI agents for complex, multi-step tasks, MementoGUI offers a robust solution to overcome memory limitations. You should consider integrating agentic memory control to manage interaction history, preserving critical visual and textual context without overwhelming the model. This approach can significantly improve task completion rates and agent reliability in long-horizon scenarios.

Key insights

MementoGUI improves long-horizon GUI agents by learning to selectively manage multimodal memory, preventing information overload and loss.

Principles

Method

MementoGUI uses MementoCore for online memory selection, compression, and retrieval, preserving task-relevant visual and textual evidence in working memory and retrieving past trajectories from episodic memory.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.