Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The GASP (Geometric-Aware Spatial Priors) framework significantly improves Vision-Language Models' (VLMs) 3D spatial reasoning. It injects fundamental geometric priors directly into the LLM's transformer layers, bypassing 3D VQA fine-tuning or specialized 3D visual encoders. GASP employs a small correspondence head, acting as a deep supervision signal across all layers. Its dual objective leverages ground-truth geometry from large-scale video scenes. This includes a contrastive loss for 2D view-invariance and depth consistency supervision for 3D geometric ambiguity resolution. Standard VLMs typically show internal correspondence matching accuracy below 5%. GASP training boosts this to over 70% peak layer-wise correspondence and maintains over 85% temporal robustness. These internal gains translate to significant downstream improvements. Specifically, it achieves +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without 3D VQA data.

Key takeaway

For Machine Learning Engineers developing Vision-Language Models needing robust 3D spatial reasoning, integrate geometric priors directly into your model's transformer layers. This GASP-like approach offers a more generalizable pathway than relying solely on 3D VQA datasets, which often overfit biases. You can achieve significant gains in internal correspondence accuracy. Expect improved downstream spatial benchmarks, like +18.2% on All-Angles Bench, without extensive 3D VQA data training. This method provides robust 3D understanding.

Key insights

Injecting fundamental geometric priors into VLMs directly enhances 3D spatial reasoning more effectively than VQA fine-tuning.

Principles

Genuine spatial understanding needs fundamental geometric priors.
Deep supervision signals can inject priors across VLM layers.
Dual objectives improve 2D view-invariance and 3D depth consistency.

Method

GASP injects priors via a small correspondence head as deep supervision across LLM transformer layers, trained with a dual objective: contrastive loss for 2D view-invariance and depth consistency for 3D ambiguity.

In practice

Improve VLM 3D reasoning without 3D VQA data.
Boost internal correspondence matching accuracy significantly.
Enhance performance on spatial benchmarks like All-Angles Bench.

Topics

Vision-Language Models
3D Spatial Reasoning
Geometric Priors
Transformer Architectures
Deep Supervision
Model Performance

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.