LARA: Latent Action Representation Alignment for Vision-Language-Action Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Latent Action Representation Alignment (LARA) is a novel framework designed to enhance Vision-Language-Action (VLA) models by addressing the scarcity of real-world robot action data. LARA jointly optimizes Latent Action Models (LAM) and diffusion-based VLA models through explicit latent action representation alignment. This reciprocal optimization allows LAMs to learn from accurate action trajectories, reducing the influence of spurious visual changes, while VLA models benefit from regularization by LAM's forward dynamics, mitigating functionally ineffective trajectory hallucinations. LARA demonstrates versatility as a pre-training method, a post-training enhancement module for existing VLA models, and a refiner for LAMs. It achieved significant performance improvements, averaging approximately 10% for full training, 5% for post-training enhancement, and 15% for LAM refinement, across three simulation and one real-world robotic manipulation benchmarks. The framework's code is publicly available.

Key takeaway

For Machine Learning Engineers developing Vision-Language-Action (VLA) models, especially when facing limited robot data or issues with action hallucination, you should consider integrating the LARA framework. LARA's joint optimization of LAM and VLA models improves performance by grounding latent actions and regularizing policies. You can apply LARA for full model pre-training, as a post-training enhancement module for existing VLA models like GR00T-N1.6, or to refine your Latent Action Models for superior pseudo-label generation.

Key insights

LARA jointly optimizes Latent Action Models and Vision-Language-Action models via representation alignment for improved robot control.

Principles

Method

LARA combines flow-matching, LAM reconstruction, and a cosine similarity-based representation alignment loss between LAM's continuous latent actions and the VLA model's intermediate DiT features.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.