Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Summary
Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. GeoVR, a novel framework, learns geometric representations using purely 2D video sequences, enabling spatial intelligence by restructuring MLLM's semantic latent space. It distills geometry knowledge from pre-trained 3D foundation models through a multi-objective learning strategy. This strategy uses four geometric targets: inter-frame camera pose estimation, dense depth map regression, metric scale factor prediction, and multi-scale 3D feature distillation. Experiments on spatial reasoning benchmarks like VSI-Bench show GeoVR-2B achieves a 69.1 average score, outperforming Qwen3-VL-2B-Instruct (50.3) by 18.8 points and surpassing proprietary models like GPT-5 (55.0) and other 3D-aware models like Cambrian-S-7B (67.5) without additional inference overhead.
Key takeaway
For AI Scientists and ML Engineers developing MLLMs for spatial reasoning, GeoVR offers a critical paradigm shift. You can now instill robust 3D awareness into your models using readily available 2D video data, bypassing the limitations of scarce 3D datasets. This method avoids inference overhead, making it practical for deployment. Consider integrating multi-objective geometric learning to fundamentally restructure your MLLM's latent space for superior spatial intelligence.
Key insights
GeoVR fundamentally restructures MLLM's internal representations for 3D spatial intelligence using 2D video and 3D foundation model distillation, without inference overhead.
Principles
- MLLMs lack intrinsic 3D awareness from 2D pre-training.
- Distilling 3D priors into MLLM latent space enhances spatial intelligence.
- Multi-objective geometric learning from 2D video bypasses 3D data scarcity.
Method
GeoVR employs a multi-objective learning strategy during training, using a frozen 3D foundation model as a teacher to provide pseudo-labels for camera poses, depth maps, metric scale, and multi-scale geometric feature alignment. Auxiliary heads are discarded at inference.
In practice
- Enhance MLLMs for physical world reasoning.
- Improve spatial understanding in dynamic video scenarios.
- Develop 3D-aware models without explicit 3D datasets.
Topics
- Multimodal Large Language Models
- Geometric Representation Learning
- 3D Foundation Models
- Spatial Reasoning
- Video Understanding
- Depth Estimation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.