OneCanvas: 3D Scene Understanding via Panoramic Reprojection

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

OneCanvas introduces a novel approach to 3D scene understanding for Vision-Language Models (VLMs) by aggregating patch features onto a single equirectangular panoramic canvas. This method unprojects each patch to a 3D world coordinate using its depth and camera pose, then places it on the canvas based on continuous longitude and latitude. A 3D position embedding is added to restore depth information, allowing the pretrained VLM to process this representation as a standard image. OneCanvas supports situated reasoning for robotics and embodied AI and enables a spatial pretraining curriculum that procedurally generates diverse spatial reasoning tasks. It achieves state-of-the-art accuracy on SQA3D and VSI-Bench, generalizes to out-of-distribution data on SPBench, and uses significantly less training compute than competing methods.

Key takeaway

For AI Scientists or ML Engineers developing 3D scene understanding VLMs, OneCanvas offers a highly efficient and accurate alternative to complex geometry encoders. Its panoramic reprojection and spatial pretraining curriculum can significantly reduce computational costs while improving generalization across various benchmarks. You should consider adopting this approach to enhance your models' spatial reasoning capabilities.

Key insights

OneCanvas uses panoramic reprojection and 3D position embeddings for efficient, state-of-the-art 3D scene understanding in VLMs.

Principles

Aggregating features onto a shared panoramic canvas simplifies 3D representation.
3D position embeddings restore depth information lost in angular coordinates.
Procedural spatial pretraining can generate diverse supervision on-the-fly.

Method

Unproject patches to 3D world coordinates, place them on an equirectangular panoramic canvas, add a 3D position embedding, and feed this representation to a pretrained VLM.

In practice

Apply to robotics for situated reasoning from specific viewpoints.
Generate diverse spatial reasoning tasks procedurally for VLM training.
Achieve state-of-the-art 3D understanding with reduced training compute.

Topics

3D Scene Understanding
Vision-Language Models
Panoramic Reprojection
Spatial Reasoning
Embodied AI
SQA3D

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.