GeoWorld-VLM: Geometry from World Models for Vision-Language Models

2026-05-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

GeoWorld-VLM is a novel vision-language model (VLM) distillation framework designed to enhance spatial reasoning capabilities, addressing VLMs' brittleness on elementary spatial relations. It transfers geometric structure from frozen camera-conditioned video world models into VLMs by fine-tuning only the image encoder and multimodal projector. This process aligns post-projector image features with intermediate world-model representations, which convert static visual input into a synthetic multi-view spatial signal using sampled camera trajectories. The language model remains frozen, preserving linguistic capabilities. GeoWorld-VLM consistently improves performance by approximately 4% on both the What'sUp and VSR benchmarks, outperforming baselines like original Gemma-4 and fine-tuned Gemma with DINO features. It shows strong gains on geometry-sensitive relations such as "above," "under," "close," and "far" across Gemma4 and InternVL3.5-2B backbones.

Key takeaway

For AI Scientists and Machine Learning Engineers aiming to improve VLM spatial reasoning without retraining large language models, GeoWorld-VLM provides a compelling solution. You can significantly enhance your VLM's ability to handle complex spatial relations like "above" or "far" by distilling geometry-aware features from camera-conditioned world models into the visual pathway. This approach preserves your model's linguistic capabilities while boosting visual understanding, offering a targeted upgrade for spatially grounded multimodal intelligence.

Key insights

VLMs' spatial reasoning improves by distilling geometry-aware features from camera-conditioned world models into their visual pathway.

Principles

Spatial reasoning failures often stem from insufficient 3D structural cues.
World models can generate synthetic multi-view spatial signals for geometry teaching.
Freezing the language model isolates spatial improvements to the visual pathway.

Method

Fine-tune VLM image encoder and multimodal projector by aligning post-projector features with intermediate world-model representations, conditioned on images, prompts, and sampled camera trajectories, using a combined loss.

In practice

Employ camera-conditioned world models for geometry-aware visual supervision.
Align VLM post-projector features with world-model representations.
Preserve VLM linguistic capabilities by freezing the language model.

Topics

Vision-Language Models
Spatial Reasoning
World Models
Feature Distillation
Gemma
InternVL3.5

Code references

Harvard-AI-and-Robotics-Lab/GeoWorld-VLM

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.