$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

$μ$VLA, a family of OpenVLA-OFT variants, represents a controlled study isolating recurrence in vision-language-action (VLA) models. It augments a transformer with learnable memory tokens, carried across timesteps and updated via self-attention. This system is trained end-to-end using truncated backpropagation through time (TBPTT), without auxiliary losses or architectural changes. Parameterized by memory width m, TBPTT length K, and memory update rule, $μ$VLA significantly improves performance in partially observable robotic manipulation. On MIKASA-Robo, it boosted average success from 0.42 to 0.84 on training tasks. It also achieved 0.23 on held-out tasks versus 0.07 for memoryless baselines, and maintained 96.2% success on LIBERO under full observability.

Key takeaway

For Robotics Engineers developing VLA models for partially observable manipulation, you should consider integrating minimal in-backbone recurrence. This approach, using learnable memory tokens and truncated backpropagation through time, significantly boosts success rates. For example, it improved MIKASA-Robo training tasks from 0.42 to 0.84 without complex architectural overhauls. Evaluate memory width and TBPTT length to optimize performance for your specific tasks.

Key insights

Isolating minimal recurrence in VLA models significantly enhances performance in partially observable robotic manipulation.

Principles

Method

Augments a transformer with learnable memory tokens, updated through self-attention, trained end-to-end with truncated backpropagation through time (TBPTT).

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.