Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. GeoVR, a novel framework, learns geometric representations using purely 2D video sequences, enabling spatial intelligence by restructuring MLLM's semantic latent space. It distills geometry knowledge from pre-trained 3D foundation models through a multi-objective learning strategy. This strategy uses four geometric targets: inter-frame camera pose estimation, dense depth map regression, metric scale factor prediction, and multi-scale 3D feature distillation. Experiments on spatial reasoning benchmarks like VSI-Bench show GeoVR-2B achieves a 69.1 average score, outperforming Qwen3-VL-2B-Instruct (50.3) by 18.8 points and surpassing proprietary models like GPT-5 (55.0) and other 3D-aware models like Cambrian-S-7B (67.5) without additional inference overhead.

Key takeaway

For AI Scientists and ML Engineers developing MLLMs for spatial reasoning, GeoVR offers a critical paradigm shift. You can now instill robust 3D awareness into your models using readily available 2D video data, bypassing the limitations of scarce 3D datasets. This method avoids inference overhead, making it practical for deployment. Consider integrating multi-objective geometric learning to fundamentally restructure your MLLM's latent space for superior spatial intelligence.

Key insights

GeoVR fundamentally restructures MLLM's internal representations for 3D spatial intelligence using 2D video and 3D foundation model distillation, without inference overhead.

Principles

MLLMs lack intrinsic 3D awareness from 2D pre-training.
Distilling 3D priors into MLLM latent space enhances spatial intelligence.
Multi-objective geometric learning from 2D video bypasses 3D data scarcity.

Method

GeoVR employs a multi-objective learning strategy during training, using a frozen 3D foundation model as a teacher to provide pseudo-labels for camera poses, depth maps, metric scale, and multi-scale geometric feature alignment. Auxiliary heads are discarded at inference.

In practice

Enhance MLLMs for physical world reasoning.
Improve spatial understanding in dynamic video scenarios.
Develop 3D-aware models without explicit 3D datasets.

Topics

Multimodal Large Language Models
Geometric Representation Learning
3D Foundation Models
Spatial Reasoning
Video Understanding
Depth Estimation

Code references

WHB139426/GeoVR-MLLM

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.