Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks
Summary
Multi-Scale Embodied Memory (MEM) is a dual-track architecture designed to enable Vision-Language-Action (VLA) models, specifically π0.6 initialized from Gemma 3-4B, to execute complex robotic tasks lasting up to 15 minutes. This system divides memory into two distinct components: a short-term video encoder that employs space-time separable attention to process dense visual history for approximately one minute, maintaining real-time inference within a ~380ms barrier. Concurrently, a long-term language-based memory component allows a high-level policy to maintain a compressed semantic summary of past events. MEM reduces computational complexity to O(Kn^2+nK^2), facilitating robot adaptation to partial observability and enabling in-context learning, such as improving door-opening success rates by 62% after initial failures, while matching the dexterity of current memoryless policies.
Key takeaway
For AI Scientists developing VLA models for complex robotic systems, MEM offers a critical architectural solution for extending task horizons and improving adaptability. Your models can achieve up to 15 minutes of context and significantly enhance success rates in dynamic environments, such as a 62% improvement in door-opening tasks. Consider integrating this dual-track memory system to overcome limitations of memoryless policies and enable more robust, real-world robot behaviors.
Key insights
MEM provides VLA models with dual-track memory for long-horizon robotic tasks, enabling adaptation and real-time performance.
Principles
- Factorize memory into short-term visual and long-term semantic tracks.
- Maintain real-time inference with space-time separable attention.
- Enable in-context adaptation through semantic memory.
Method
MEM uses a short-term video encoder with space-time separable attention for dense visual history and a long-term language-based memory for semantic summaries, reducing computational complexity to O(Kn^2+nK^2).
In practice
- Integrate MEM with Gemma 3-4B VLA models.
- Apply MEM for complex, multi-step robotic manipulation.
- Utilize MEM for tasks requiring adaptation to failures.
Topics
- Multi-Scale Embodied Memory
- Vision-Language-Action Models
- Robotics
- Gemma 3-4B
- Memory Architectures
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.