UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UMI-3D is a multimodal extension of the Universal Manipulation Interface (UMI) designed for robust and scalable data collection in embodied manipulation. It integrates a lightweight, low-cost LiDAR sensor into the wrist-mounted interface, addressing the original UMI's limitations with monocular visual SLAM, which was vulnerable to occlusions, dynamic scenes, and tracking failures. UMI-3D employs LiDAR-centric SLAM for accurate metric-scale pose estimation, even under challenging conditions. The system features a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework, aligning visual observations with LiDAR point clouds to produce consistent 3D representations of demonstrations. This enhancement significantly improves data quality and reliability, enabling the learning of tasks previously challenging or infeasible for vision-only UMI, such as large deformable object manipulation and articulated object operation. The entire hardware and software stack is open-sourced.

Key takeaway

For research scientists developing embodied manipulation policies, UMI-3D offers a critical advancement by providing a robust, scalable data collection system. You should consider adopting LiDAR-centric SLAM to overcome visual SLAM limitations, enabling reliable data acquisition in complex, real-world environments. This approach allows for learning policies for tasks involving deformable or articulated objects that were previously infeasible, directly improving policy performance and generalization.

Key insights

Integrating LiDAR into a wrist-mounted manipulation interface significantly enhances data quality and task reliability for embodied robot learning.

Principles

Method

UMI-3D uses a wrist-mounted LiDAR-visual sensor suite with hardware synchronization and a two-stage calibration. It employs LiDAR-inertial odometry via an iterated error-state Kalman filter (ESIKF) for robust state estimation and a Zarr-based replay buffer for diffusion policy training.

In practice

Topics

Best for: Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.