Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. GeoVR, a novel framework, learns geometric representations using purely 2D video sequences, enabling spatial intelligence by restructuring MLLM's semantic latent space. It distills geometry knowledge from pre-trained 3D foundation models through a multi-objective learning strategy. This strategy uses four geometric targets: inter-frame camera pose estimation, dense depth map regression, metric scale factor prediction, and multi-scale 3D feature distillation. Experiments on spatial reasoning benchmarks like VSI-Bench show GeoVR-2B achieves a 69.1 average score, outperforming Qwen3-VL-2B-Instruct (50.3) by 18.8 points and surpassing proprietary models like GPT-5 (55.0) and other 3D-aware models like Cambrian-S-7B (67.5) without additional inference overhead.

Key takeaway

For AI Scientists and ML Engineers developing MLLMs for spatial reasoning, GeoVR offers a critical paradigm shift. You can now instill robust 3D awareness into your models using readily available 2D video data, bypassing the limitations of scarce 3D datasets. This method avoids inference overhead, making it practical for deployment. Consider integrating multi-objective geometric learning to fundamentally restructure your MLLM's latent space for superior spatial intelligence.

Key insights

GeoVR fundamentally restructures MLLM's internal representations for 3D spatial intelligence using 2D video and 3D foundation model distillation, without inference overhead.

Principles

Method

GeoVR employs a multi-objective learning strategy during training, using a frozen 3D foundation model as a teacher to provide pseudo-labels for camera poses, depth maps, metric scale, and multi-scale geometric feature alignment. Auxiliary heads are discarded at inference.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.