UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
Summary
UMI-3D is a multimodal extension of the Universal Manipulation Interface (UMI) designed for robust and scalable data collection in embodied manipulation. It integrates a lightweight, low-cost LiDAR sensor into the wrist-mounted interface, addressing the original UMI's limitations with monocular visual SLAM, which was vulnerable to occlusions, dynamic scenes, and tracking failures. UMI-3D employs LiDAR-centric SLAM for accurate metric-scale pose estimation, even under challenging conditions. The system features a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework, aligning visual observations with LiDAR point clouds to produce consistent 3D representations of demonstrations. This enhancement significantly improves data quality and reliability, enabling the learning of tasks previously challenging or infeasible for vision-only UMI, such as large deformable object manipulation and articulated object operation. The entire hardware and software stack is open-sourced.
Key takeaway
For research scientists developing embodied manipulation policies, UMI-3D offers a critical advancement by providing a robust, scalable data collection system. You should consider adopting LiDAR-centric SLAM to overcome visual SLAM limitations, enabling reliable data acquisition in complex, real-world environments. This approach allows for learning policies for tasks involving deformable or articulated objects that were previously infeasible, directly improving policy performance and generalization.
Key insights
Integrating LiDAR into a wrist-mounted manipulation interface significantly enhances data quality and task reliability for embodied robot learning.
Principles
- Self-contained pose estimation is critical for portable deployment.
- Multimodal sensing requires strict spatiotemporal consistency.
- Localization accuracy must be stable across diverse conditions.
Method
UMI-3D uses a wrist-mounted LiDAR-visual sensor suite with hardware synchronization and a two-stage calibration. It employs LiDAR-inertial odometry via an iterated error-state Kalman filter (ESIKF) for robust state estimation and a Zarr-based replay buffer for diffusion policy training.
In practice
- Use a wide-FoV fisheye camera (185°) for broad visual context.
- Employ ArUco markers for continuous gripper width tracking.
- Represent actions as relative SE(3) trajectories for robustness.
Topics
- Universal Manipulation Interface
- LiDAR-centric SLAM
- Multimodal Sensing
- Embodied Manipulation
- Diffusion Policy
Best for: Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.