Geometric Action Model for Robot Policy Learning

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The Geometric Action Model (GAM) is a novel language-conditioned manipulation policy designed to enhance robot policy learning by directly integrating 3D geometric reasoning. Unlike current vision-language-action models (VLAs) and video world-action models (WAMs) that primarily operate on 2D image frames or latent spaces, GAM repurposes a pretrained geometric foundation model (GFM) as a unified substrate for perception, temporal prediction, and action decoding. Its architecture splits the GFM: shallow layers encode observations, while a causal future predictor forecasts future latent tokens based on language, proprioception, and action history. These predicted tokens then pass through the remaining GFM blocks to produce both future geometry and actions. This design minimally modifies the GFM while preserving its rich geometric priors, resulting in a model that is more accurate, robust, faster, and lighter than existing foundation-model-scale baselines across various simulation and real-robot manipulation benchmarks.

Key takeaway

For Robotics Engineers developing manipulation policies, you should consider integrating explicit 3D geometric reasoning to overcome limitations of 2D-based models. Adopting a Geometric Action Model (GAM) approach, which repurposes pretrained geometric foundation models, can yield more accurate, robust, and efficient robot control for contact-rich tasks. This method allows you to leverage existing geometric priors with minimal architectural changes, potentially accelerating development and improving real-world performance.

Key insights

GAM integrates 3D geometric reasoning into robot policy learning by repurposing a pretrained geometric foundation model for perception, prediction, and action.

Principles

Repurpose GFMs for unified perception and action.
Explicit 3D geometry improves manipulation policies.
Minimal architectural changes preserve priors.

Method

GAM splits a pretrained GFM; shallow layers encode observations, a causal predictor forecasts future latent tokens, and remaining GFM blocks decode these into future geometry and actions.

In practice

Apply GFMs for language-conditioned robot tasks.
Use split-layer architecture for temporal prediction.
Improve contact-rich manipulation accuracy.

Topics

Robot Policy Learning
Geometric Foundation Models
3D Geometric Reasoning
Manipulation Robotics
Vision-Language-Action Models
Temporal Prediction

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.