Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
Summary
A new video augmentation framework efficiently converts simulated vision-language-action (VLA) videos into realistic training videos, addressing the visual domain gap and limited environmental diversity inherent in simulated data. This pipeline extracts structured conditions from simulation using video semantic segmentation and video captioning, rewrites captions to diversify environments, and then synthesizes realistic videos via a conditional video transfer model. To enable practical, large-scale augmentation, the framework incorporates a diffusion feature-reuse mechanism that accelerates generation by reusing video tokens across adjacent timesteps, alongside a coreset sampling strategy for identifying compact, non-redundant subsets under computational constraints. Experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform show consistent improvements, including an 8% boost for RDT-1B on Robotwin 2.0 and a 5.1% increase for $\pi_0$ on LIBERO-Plus.
Key takeaway
For AI Engineers developing VLA models, this framework offers a practical approach to overcome the limitations of simulated data. You can significantly enhance model generalization by converting inexpensive simulated videos into high-fidelity training assets, leveraging the diffusion feature-reuse and coreset sampling for efficient, scalable data augmentation. Consider integrating this pipeline to improve performance on benchmarks like Robotwin 2.0 and LIBERO-Plus.
Key insights
An efficient video augmentation framework converts simulated VLA data into realistic training videos, preserving task semantics and action trajectories.
Principles
- Simulated data can be made realistic.
- Task semantics must be preserved.
- Efficiency is key for large-scale augmentation.
Method
Extract structured conditions via semantic segmentation and captioning, rewrite captions for diversity, then synthesize realistic videos using a conditional video transfer model with diffusion feature-reuse and coreset sampling.
In practice
- Improve RDT-1B by 8% on Robotwin 2.0.
- Boost $\pi_0$ by 5.1% on LIBERO-Plus.
Topics
- Vision-Language-Action Models
- Simulated Data Augmentation
- Conditional Video Transfer
- Diffusion Feature Reuse
- Coreset Sampling
Code references
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.