UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UMI-3D is a multimodal extension of the Universal Manipulation Interface (UMI) designed for robust and scalable data collection in embodied manipulation. It integrates a lightweight, low-cost LiDAR sensor into the wrist-mounted interface, addressing the original UMI's limitations with monocular visual SLAM, which was vulnerable to occlusions, dynamic scenes, and tracking failures. UMI-3D employs LiDAR-centric SLAM for accurate metric-scale pose estimation, even under challenging conditions. The system features a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework, aligning visual observations with LiDAR point clouds to produce consistent 3D representations of demonstrations. This enhancement significantly improves data quality and reliability, enabling the learning of tasks previously challenging or infeasible for vision-only UMI, such as large deformable object manipulation and articulated object operation. The entire hardware and software stack is open-sourced.

Key takeaway

For research scientists developing embodied manipulation policies, UMI-3D offers a critical advancement by providing a robust, scalable data collection system. You should consider adopting LiDAR-centric SLAM to overcome visual SLAM limitations, enabling reliable data acquisition in complex, real-world environments. This approach allows for learning policies for tasks involving deformable or articulated objects that were previously infeasible, directly improving policy performance and generalization.

Key insights

Integrating LiDAR into a wrist-mounted manipulation interface significantly enhances data quality and task reliability for embodied robot learning.

Principles

Self-contained pose estimation is critical for portable deployment.
Multimodal sensing requires strict spatiotemporal consistency.
Localization accuracy must be stable across diverse conditions.

Method

UMI-3D uses a wrist-mounted LiDAR-visual sensor suite with hardware synchronization and a two-stage calibration. It employs LiDAR-inertial odometry via an iterated error-state Kalman filter (ESIKF) for robust state estimation and a Zarr-based replay buffer for diffusion policy training.

In practice

Use a wide-FoV fisheye camera (185°) for broad visual context.
Employ ArUco markers for continuous gripper width tracking.
Represent actions as relative SE(3) trajectories for robustness.

Topics

Universal Manipulation Interface
LiDAR-centric SLAM
Multimodal Sensing
Embodied Manipulation
Diffusion Policy

Best for: Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.