Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

· Source: Machine Learning · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Pose6DAug is a novel failure-driven data augmentation framework designed to improve vision-language-action (VLA) policies' performance on novel, out-of-distribution objects. VLA policies often struggle with new objects, requiring expensive multi-view teleoperation data collection for each failure. Pose6DAug addresses this by transforming a policy's successful episodes into targeted demonstrations for failure modes, eliminating the need for new data. The method leverages existing successful episodes, which contain physically valid action trajectories and calibrated multi-view observations. It swaps only the manipulated object while preserving the original trajectory, generating new, physically grounded demonstrations. Unlike 2D video editing, Pose6DAug operates directly in 3D, using an explicit mesh driven by a temporally coherent 6D pose trajectory to ensure geometrically consistent renderings across all camera views. Fine-tuning VLA policies with Pose6DAug-augmented data leads to a 16.5% relative increase in success rates on novel objects, without compromising in-distribution performance. This demonstrates a practical approach to scalable VLA generalization.

Key takeaway

For Machine Learning Engineers developing robot manipulation policies, Pose6DAug offers a scalable solution to improve generalization on novel objects. You should consider integrating this 3D object swapping framework to augment your VLA training datasets, especially when facing high costs or time constraints for new data collection. This approach can significantly boost success rates by 16.5% on out-of-distribution items, preserving in-distribution performance and reducing reliance on extensive teleoperation.

Key insights

Pose6DAug enables VLA policy generalization by physically plausible 3D object swapping in successful robot episodes.

Principles

Method

Pose6DAug operates in 3D, anchoring a target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent multi-view renderings for data augmentation.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.