GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

GEM-4D is a novel geometry-grounded video world model designed to overcome the limitations of existing models that produce visually plausible but geometrically inconsistent video futures, hindering reliable robot manipulation. It integrates dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, directly into its video generative backbone during training. This approach enables GEM-4D to jointly capture appearance and geometric structure within a single-stream architecture, incurring no additional inference cost. The model also features an inverse dynamics module that translates these geometrically consistent video rollouts into executable 6-DoF robot trajectories for direct deployment. GEM-4D achieves state-of-the-art performance in video prediction and geometric consistency across both simulated and realistic environments, significantly improving real-world manipulation success rates from 61% to 81% on Droid tasks and 63%–82% on RLBench.

Key takeaway

For Robotics Engineers developing manipulation systems, if your current video world models produce visually plausible but geometrically inconsistent futures, consider adopting geometry-enhanced distillation. This approach, exemplified by GEM-4D, significantly improves the physical grounding of generated video rollouts, boosting real-world manipulation success from 61% to 81%. You should explore integrating 4D geometry foundation models to ensure correspondence-consistent scene evolution, enabling more reliable action extraction and robust robot control.

Key insights

Distilling 4D geometry foundation model features into video world models ensures correspondence-consistent generation for robot manipulation.

Principles

Method

A dual flow-matching framework distills 4D geometry features into a video backbone via asymmetric conditioning, then an inverse dynamics module extracts executable 6-DoF trajectories from generated rollouts.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.