Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

2026-03-04 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

Multi-Scale Embodied Memory (MEM) is a dual-track architecture designed to enable Vision-Language-Action (VLA) models, specifically π0.6 initialized from Gemma 3-4B, to execute complex robotic tasks lasting up to 15 minutes. This system divides memory into two distinct components: a short-term video encoder that employs space-time separable attention to process dense visual history for approximately one minute, maintaining real-time inference within a ~380ms barrier. Concurrently, a long-term language-based memory component allows a high-level policy to maintain a compressed semantic summary of past events. MEM reduces computational complexity to O(Kn^2+nK^2), facilitating robot adaptation to partial observability and enabling in-context learning, such as improving door-opening success rates by 62% after initial failures, while matching the dexterity of current memoryless policies.

Key takeaway

For AI Scientists developing VLA models for complex robotic systems, MEM offers a critical architectural solution for extending task horizons and improving adaptability. Your models can achieve up to 15 minutes of context and significantly enhance success rates in dynamic environments, such as a 62% improvement in door-opening tasks. Consider integrating this dual-track memory system to overcome limitations of memoryless policies and enable more robust, real-world robot behaviors.

Key insights

MEM provides VLA models with dual-track memory for long-horizon robotic tasks, enabling adaptation and real-time performance.

Principles

Factorize memory into short-term visual and long-term semantic tracks.
Maintain real-time inference with space-time separable attention.
Enable in-context adaptation through semantic memory.

Method

MEM uses a short-term video encoder with space-time separable attention for dense visual history and a long-term language-based memory for semantic summaries, reducing computational complexity to O(Kn^2+nK^2).

In practice

Integrate MEM with Gemma 3-4B VLA models.
Apply MEM for complex, multi-step robotic manipulation.
Utilize MEM for tasks requiring adaptation to failures.

Topics

Multi-Scale Embodied Memory
Vision-Language-Action Models
Robotics
Gemma 3-4B
Memory Architectures

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.