3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

3D HAMSTER is a new hierarchical framework designed to improve robot manipulation by bridging the gap between high-level planning and low-level control in Vision-Language-Action (VLA) models. Existing VLA models often use 2D end-effector trajectories as guidance for downstream policies, leading to geometrically distorted paths when these 2D waypoints are assigned arbitrary depths for 3D point cloud-based low-level policies. 3D HAMSTER addresses this by augmenting a Vision-Language Model (VLM) with a dedicated depth encoder and a dense depth reconstruction objective, enabling it to directly predict metrically reliable 3D waypoint sequences. These 3D trajectories are then seamlessly integrated into a pointcloud-based low-level policy. The framework consistently outperforms proprietary VLMs and 2D-guided baselines across 3D trajectory prediction, simulation, and real-world manipulation tasks, showing significant gains under appearance-altering shifts and unseen language, spatial, and visual conditions.

Key takeaway

For Robotics Engineers developing manipulation systems, 3D HAMSTER demonstrates a critical shift from 2D to 3D trajectory guidance. You should consider augmenting your Vision-Language Models with depth encoders to generate metrically reliable 3D waypoint sequences, directly improving low-level policy performance. This approach enhances generalization and robustness, especially in varied visual and spatial conditions, reducing geometric distortions in robot movements.

Key insights

3D HAMSTER improves robot manipulation by generating metrically reliable 3D trajectories directly from a VLM, overcoming 2D guidance limitations.

Principles

3D guidance prevents geometric distortion.
Depth encoders enhance VLM spatial awareness.
Hierarchical VLA models improve generalization.

Method

Augments a VLM with a depth encoder and dense depth reconstruction objective to predict 3D waypoint sequences, which are then integrated into a pointcloud-based low-level policy.

In practice

Integrate depth encoders into VLM planners.
Use 3D trajectories for robot control.
Test under appearance shifts for robustness.

Topics

3D HAMSTER
Robot Manipulation
Vision-Language Models
Hierarchical Control
3D Trajectory Prediction
Point Cloud Policies

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.