GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

GEM-4D is a novel geometry-grounded video world model designed to overcome the limitations of existing models that produce visually plausible but geometrically inconsistent video futures, hindering reliable robot manipulation. It integrates dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, directly into its video generative backbone during training. This approach enables GEM-4D to jointly capture appearance and geometric structure within a single-stream architecture, incurring no additional inference cost. The model also features an inverse dynamics module that translates these geometrically consistent video rollouts into executable 6-DoF robot trajectories for direct deployment. GEM-4D achieves state-of-the-art performance in video prediction and geometric consistency across both simulated and realistic environments, significantly improving real-world manipulation success rates from 61% to 81% on Droid tasks and 63%–82% on RLBench.

Key takeaway

For Robotics Engineers developing manipulation systems, if your current video world models produce visually plausible but geometrically inconsistent futures, consider adopting geometry-enhanced distillation. This approach, exemplified by GEM-4D, significantly improves the physical grounding of generated video rollouts, boosting real-world manipulation success from 61% to 81%. You should explore integrating 4D geometry foundation models to ensure correspondence-consistent scene evolution, enabling more reliable action extraction and robust robot control.

Key insights

Distilling 4D geometry foundation model features into video world models ensures correspondence-consistent generation for robot manipulation.

Principles

Geometry supervision regularizes video backbones to encode correspondence-consistent structure.
Internal representations encoding depth, camera pose, and scene flow guarantee correct correspondences.
Asymmetric coupling allows geometry distillation during training with zero inference cost.

Method

A dual flow-matching framework distills 4D geometry features into a video backbone via asymmetric conditioning, then an inverse dynamics module extracts executable 6-DoF trajectories from generated rollouts.

In practice

Integrate geometry foundation model representations as correspondence teachers.
Use a dual-criterion confidence-gated tracker for robust robot tracking.
Apply geometry–kinematics pose fallback for unstable pose estimates.

Topics

Robot Manipulation
Video World Models
4D Geometry
Correspondence Learning
Inverse Dynamics
Embodied AI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.