GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Summary
GeoWorld-VLM is a novel vision-language model (VLM) distillation framework designed to enhance spatial reasoning capabilities, addressing VLMs' brittleness on elementary spatial relations. It transfers geometric structure from frozen camera-conditioned video world models into VLMs by fine-tuning only the image encoder and multimodal projector. This process aligns post-projector image features with intermediate world-model representations, which convert static visual input into a synthetic multi-view spatial signal using sampled camera trajectories. The language model remains frozen, preserving linguistic capabilities. GeoWorld-VLM consistently improves performance by approximately 4% on both the What'sUp and VSR benchmarks, outperforming baselines like original Gemma-4 and fine-tuned Gemma with DINO features. It shows strong gains on geometry-sensitive relations such as "above," "under," "close," and "far" across Gemma4 and InternVL3.5-2B backbones.
Key takeaway
For AI Scientists and Machine Learning Engineers aiming to improve VLM spatial reasoning without retraining large language models, GeoWorld-VLM provides a compelling solution. You can significantly enhance your VLM's ability to handle complex spatial relations like "above" or "far" by distilling geometry-aware features from camera-conditioned world models into the visual pathway. This approach preserves your model's linguistic capabilities while boosting visual understanding, offering a targeted upgrade for spatially grounded multimodal intelligence.
Key insights
VLMs' spatial reasoning improves by distilling geometry-aware features from camera-conditioned world models into their visual pathway.
Principles
- Spatial reasoning failures often stem from insufficient 3D structural cues.
- World models can generate synthetic multi-view spatial signals for geometry teaching.
- Freezing the language model isolates spatial improvements to the visual pathway.
Method
Fine-tune VLM image encoder and multimodal projector by aligning post-projector features with intermediate world-model representations, conditioned on images, prompts, and sampled camera trajectories, using a combined loss.
In practice
- Employ camera-conditioned world models for geometry-aware visual supervision.
- Align VLM post-projector features with world-model representations.
- Preserve VLM linguistic capabilities by freezing the language model.
Topics
- Vision-Language Models
- Spatial Reasoning
- World Models
- Feature Distillation
- Gemma
- InternVL3.5
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.