LARA: Latent Action Representation Alignment for Vision-Language-Action Models

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Latent Action Representation Alignment (LARA) is a novel framework designed to enhance Vision-Language-Action (VLA) models by addressing the scarcity of real-world robot action data. LARA jointly optimizes Latent Action Models (LAM) and diffusion-based VLA models through explicit latent action representation alignment. This reciprocal optimization allows LAMs to learn from accurate action trajectories, reducing the influence of spurious visual changes, while VLA models benefit from regularization by LAM's forward dynamics, mitigating functionally ineffective trajectory hallucinations. LARA demonstrates versatility as a pre-training method, a post-training enhancement module for existing VLA models, and a refiner for LAMs. It achieved significant performance improvements, averaging approximately 10% for full training, 5% for post-training enhancement, and 15% for LAM refinement, across three simulation and one real-world robotic manipulation benchmarks. The framework's code is publicly available.

Key takeaway

For Machine Learning Engineers developing Vision-Language-Action (VLA) models, especially when facing limited robot data or issues with action hallucination, you should consider integrating the LARA framework. LARA's joint optimization of LAM and VLA models improves performance by grounding latent actions and regularizing policies. You can apply LARA for full model pre-training, as a post-training enhancement module for existing VLA models like GR00T-N1.6, or to refine your Latent Action Models for superior pseudo-label generation.

Key insights

LARA jointly optimizes Latent Action Models and Vision-Language-Action models via representation alignment for improved robot control.

Principles

Joint optimization grounds inverse visual dynamics to real actions.
Forward dynamics grounding reduces kinematically plausible but incorrect actions.
Optimal alignment depth is architecture-dependent.

Method

LARA combines flow-matching, LAM reconstruction, and a cosine similarity-based representation alignment loss between LAM's continuous latent actions and the VLA model's intermediate DiT features.

In practice

Apply LARA for pre-training VLA models.
Enhance pre-trained VLAs post-training.
Refine LAMs for better pseudo-labels.

Topics

Vision-Language-Action Models
Latent Action Models
Representation Alignment
Robotic Manipulation
Diffusion Models
Data Efficiency

Code references

lmy1001/LARA

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.