Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation
Summary
Pose6DAug is a novel failure-driven data augmentation framework designed to improve vision-language-action (VLA) policies' performance on novel, out-of-distribution objects. VLA policies often struggle with new objects, requiring expensive multi-view teleoperation data collection for each failure. Pose6DAug addresses this by transforming a policy's successful episodes into targeted demonstrations for failure modes, eliminating the need for new data. The method leverages existing successful episodes, which contain physically valid action trajectories and calibrated multi-view observations. It swaps only the manipulated object while preserving the original trajectory, generating new, physically grounded demonstrations. Unlike 2D video editing, Pose6DAug operates directly in 3D, using an explicit mesh driven by a temporally coherent 6D pose trajectory to ensure geometrically consistent renderings across all camera views. Fine-tuning VLA policies with Pose6DAug-augmented data leads to a 16.5% relative increase in success rates on novel objects, without compromising in-distribution performance. This demonstrates a practical approach to scalable VLA generalization.
Key takeaway
For Machine Learning Engineers developing robot manipulation policies, Pose6DAug offers a scalable solution to improve generalization on novel objects. You should consider integrating this 3D object swapping framework to augment your VLA training datasets, especially when facing high costs or time constraints for new data collection. This approach can significantly boost success rates by 16.5% on out-of-distribution items, preserving in-distribution performance and reducing reliance on extensive teleoperation.
Key insights
Pose6DAug enables VLA policy generalization by physically plausible 3D object swapping in successful robot episodes.
Principles
- Successful episodes encode valid action trajectories.
- 3D object swapping ensures multi-view consistency.
- Failure-driven augmentation targets specific weaknesses.
Method
Pose6DAug operates in 3D, anchoring a target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent multi-view renderings for data augmentation.
In practice
- Augment VLA training data with novel objects.
- Improve robot manipulation success rates.
- Reduce need for new teleoperation data.
Topics
- Robotics
- Data Augmentation
- Vision-Language-Action Policies
- 6D Pose Estimation
- Object Swapping
- Robot Manipulation
Best for: Computer Vision Engineer, Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.