$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

2026-06-10 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

$μ$VLA, a family of OpenVLA-OFT variants, represents a controlled study isolating recurrence in vision-language-action (VLA) models. It augments a transformer with learnable memory tokens, carried across timesteps and updated via self-attention. This system is trained end-to-end using truncated backpropagation through time (TBPTT), without auxiliary losses or architectural changes. Parameterized by memory width m, TBPTT length K, and memory update rule, $μ$VLA significantly improves performance in partially observable robotic manipulation. On MIKASA-Robo, it boosted average success from 0.42 to 0.84 on training tasks. It also achieved 0.23 on held-out tasks versus 0.07 for memoryless baselines, and maintained 96.2% success on LIBERO under full observability.

Key takeaway

For Robotics Engineers developing VLA models for partially observable manipulation, you should consider integrating minimal in-backbone recurrence. This approach, using learnable memory tokens and truncated backpropagation through time, significantly boosts success rates. For example, it improved MIKASA-Robo training tasks from 0.42 to 0.84 without complex architectural overhauls. Evaluate memory width and TBPTT length to optimize performance for your specific tasks.

Key insights

Isolating minimal recurrence in VLA models significantly enhances performance in partially observable robotic manipulation.

Principles

Recurrence can be isolated and effective in VLA models.
Minimal in-backbone recurrence has a defined capability envelope.
Partial observability necessitates memory beyond current observations.

Method

Augments a transformer with learnable memory tokens, updated through self-attention, trained end-to-end with truncated backpropagation through time (TBPTT).

In practice

Integrate learnable memory tokens into transformer self-attention.
Train recurrent VLAs with truncated backpropagation through time.
Tune memory width m and TBPTT length K for VLA performance.

Topics

Vision-Language-Action Models
Robotic Manipulation
Partial Observability
Recurrent Memory
Transformer Networks
Truncated Backpropagation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.