$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Summary
$μ$VLA, a family of OpenVLA-OFT variants, represents a controlled study isolating recurrence in vision-language-action (VLA) models. It augments a transformer with learnable memory tokens, carried across timesteps and updated via self-attention. This system is trained end-to-end using truncated backpropagation through time (TBPTT), without auxiliary losses or architectural changes. Parameterized by memory width m, TBPTT length K, and memory update rule, $μ$VLA significantly improves performance in partially observable robotic manipulation. On MIKASA-Robo, it boosted average success from 0.42 to 0.84 on training tasks. It also achieved 0.23 on held-out tasks versus 0.07 for memoryless baselines, and maintained 96.2% success on LIBERO under full observability.
Key takeaway
For Robotics Engineers developing VLA models for partially observable manipulation, you should consider integrating minimal in-backbone recurrence. This approach, using learnable memory tokens and truncated backpropagation through time, significantly boosts success rates. For example, it improved MIKASA-Robo training tasks from 0.42 to 0.84 without complex architectural overhauls. Evaluate memory width and TBPTT length to optimize performance for your specific tasks.
Key insights
Isolating minimal recurrence in VLA models significantly enhances performance in partially observable robotic manipulation.
Principles
- Recurrence can be isolated and effective in VLA models.
- Minimal in-backbone recurrence has a defined capability envelope.
- Partial observability necessitates memory beyond current observations.
Method
Augments a transformer with learnable memory tokens, updated through self-attention, trained end-to-end with truncated backpropagation through time (TBPTT).
In practice
- Integrate learnable memory tokens into transformer self-attention.
- Train recurrent VLAs with truncated backpropagation through time.
- Tune memory width m and TBPTT length K for VLA performance.
Topics
- Vision-Language-Action Models
- Robotic Manipulation
- Partial Observability
- Recurrent Memory
- Transformer Networks
- Truncated Backpropagation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.