Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos
Summary
A novel latent-action-based framework addresses the challenge of training generalist Vision-Language-Action (VLA) models using abundant, unlabeled egocentric human manipulation videos. This architecture features a Hybrid Disentangled VQ-VAE that effectively decouples motion dynamics from environmental backgrounds via physical masks, enabling the creation of a cross-embodiment action codebook. By pre-training the VLM backbone on human videos with this codebook, the model learns deep representations of action intent. For adaptation to specific robotic embodiments, an intent-perception decoupling strategy is introduced, where the VLM predicts action intent while a separate frozen visual encoder provides state-specific features, reducing action hallucinations. This method, pre-trained exclusively on unlabeled human videos, achieves competitive performance with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.
Key takeaway
For Robotics Engineers developing generalist VLA models, this framework offers a significant pathway to overcome data scarcity. You can now leverage abundant unlabeled human egocentric videos for pre-training, drastically reducing reliance on expensive, high-fidelity robotic datasets. Consider integrating latent-action frameworks and intent-perception decoupling into your VLA development pipeline to accelerate model training and deployment with fewer robot-specific annotations.
Key insights
Unlabeled human videos can effectively train VLA models by disentangling motion and action intent.
Principles
- Decoupling motion from background enables cross-embodiment action learning.
- Latent action priors from human videos generalize to robotic tasks.
- Intent-perception decoupling reduces action hallucinations in VLA models.
Method
Train a Hybrid Disentangled VQ-VAE on human videos to build a cross-embodiment action codebook. Pre-train the VLM backbone with this codebook, then adapt using an intent-perception decoupling strategy.
In practice
- Utilize unlabeled human egocentric videos for VLA pre-training.
- Adapt VLA models with minimal (50) robot trajectories.
Topics
- VLA Models
- Latent Action
- Cross-Embodiment Learning
- Egocentric Videos
- Robotics Training
- VQ-VAE
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.