Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

2026-05-04 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

A new video augmentation framework efficiently converts simulated vision-language-action (VLA) videos into realistic training videos, addressing the visual domain gap and limited environmental diversity inherent in simulated data. This pipeline extracts structured conditions from simulation using video semantic segmentation and video captioning, rewrites captions to diversify environments, and then synthesizes realistic videos via a conditional video transfer model. To enable practical, large-scale augmentation, the framework incorporates a diffusion feature-reuse mechanism that accelerates generation by reusing video tokens across adjacent timesteps, alongside a coreset sampling strategy for identifying compact, non-redundant subsets under computational constraints. Experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform show consistent improvements, including an 8% boost for RDT-1B on Robotwin 2.0 and a 5.1% increase for $\pi_0$ on LIBERO-Plus.

Key takeaway

For AI Engineers developing VLA models, this framework offers a practical approach to overcome the limitations of simulated data. You can significantly enhance model generalization by converting inexpensive simulated videos into high-fidelity training assets, leveraging the diffusion feature-reuse and coreset sampling for efficient, scalable data augmentation. Consider integrating this pipeline to improve performance on benchmarks like Robotwin 2.0 and LIBERO-Plus.

Key insights

An efficient video augmentation framework converts simulated VLA data into realistic training videos, preserving task semantics and action trajectories.

Principles

Simulated data can be made realistic.
Task semantics must be preserved.
Efficiency is key for large-scale augmentation.

Method

Extract structured conditions via semantic segmentation and captioning, rewrite captions for diversity, then synthesize realistic videos using a conditional video transfer model with diffusion feature-reuse and coreset sampling.

In practice

Improve RDT-1B by 8% on Robotwin 2.0.
Boost $\pi_0$ by 5.1% on LIBERO-Plus.

Topics

Vision-Language-Action Models
Simulated Data Augmentation
Conditional Video Transfer
Diffusion Feature Reuse
Coreset Sampling

Code references

nanfangxiansheng/Seeing-Realism-from-Simulation

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.