3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
Summary
3D HAMSTER is a new hierarchical framework designed to improve robot manipulation by bridging the gap between high-level planning and low-level control in Vision-Language-Action (VLA) models. Existing VLA models often use 2D end-effector trajectories as guidance for downstream policies, leading to geometrically distorted paths when these 2D waypoints are assigned arbitrary depths for 3D point cloud-based low-level policies. 3D HAMSTER addresses this by augmenting a Vision-Language Model (VLM) with a dedicated depth encoder and a dense depth reconstruction objective, enabling it to directly predict metrically reliable 3D waypoint sequences. These 3D trajectories are then seamlessly integrated into a pointcloud-based low-level policy. The framework consistently outperforms proprietary VLMs and 2D-guided baselines across 3D trajectory prediction, simulation, and real-world manipulation tasks, showing significant gains under appearance-altering shifts and unseen language, spatial, and visual conditions.
Key takeaway
For Robotics Engineers developing manipulation systems, 3D HAMSTER demonstrates a critical shift from 2D to 3D trajectory guidance. You should consider augmenting your Vision-Language Models with depth encoders to generate metrically reliable 3D waypoint sequences, directly improving low-level policy performance. This approach enhances generalization and robustness, especially in varied visual and spatial conditions, reducing geometric distortions in robot movements.
Key insights
3D HAMSTER improves robot manipulation by generating metrically reliable 3D trajectories directly from a VLM, overcoming 2D guidance limitations.
Principles
- 3D guidance prevents geometric distortion.
- Depth encoders enhance VLM spatial awareness.
- Hierarchical VLA models improve generalization.
Method
Augments a VLM with a depth encoder and dense depth reconstruction objective to predict 3D waypoint sequences, which are then integrated into a pointcloud-based low-level policy.
In practice
- Integrate depth encoders into VLM planners.
- Use 3D trajectories for robot control.
- Test under appearance shifts for robustness.
Topics
- 3D HAMSTER
- Robot Manipulation
- Vision-Language Models
- Hierarchical Control
- 3D Trajectory Prediction
- Point Cloud Policies
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.