PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models
Summary
PROSE (Prompted Scene rEgistration) is a novel, training-free method for egocentric scene registration that utilizes pretrained Vision-Language Models (VLMs). This approach addresses the challenges of registering two RGB captures of an indoor space from head-mounted cameras, which typically yield blurry, fast-moving, and partially overlapping views unsuitable for traditional geometric or learned scene-graph methods. PROSE converts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language. It then prompts the same VLM to match object instances across the two sequences, employing object heights as a prior and verifying matches with same/different queries. The method solves for rigid transforms by selecting candidates with strong geometric consensus. PROSE requires no learned parameters, depth sensors, training, or annotated graphs. It outperforms both geometric and learned scene-graph baselines in registration accuracy on the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, using both ground-truth and RGB-reconstructed point clouds, and its generated scene graph is transferable to downstream tasks.
Key takeaway
For Robotics Engineers or AR System Developers building persistent spatial memory, PROSE offers a training-free, RGB-only solution for egocentric scene registration. You can achieve accurate registration without depth sensors or extensive training, overcoming challenges of blurry, fast-moving egocentric data. This method simplifies deployment by using off-the-shelf foundation models, allowing you to integrate robust spatial understanding into your systems more efficiently and reliably.
Key insights
PROSE enables training-free egocentric scene registration by leveraging pretrained VLMs for scene understanding and cross-scan object matching.
Principles
- Pretrained VLMs can serve as both scene understanding and matching engines.
- Object heights provide a robust prior for egocentric scene registration.
- Geometric consensus can validate object matches for rigid transforms.
Method
Lift RGB sequences to object-level 3D scene graphs using foundation models. Prompt VLM for cross-scan object matching, verify with same/different queries, then solve rigid transform via geometric consensus.
In practice
- Implement persistent spatial memory for robots.
- Enhance AR systems with robust scene registration.
- Generate transferable scene graphs for downstream tasks.
Topics
- Egocentric Scene Registration
- Vision-Language Models
- 3D Scene Graphs
- Robotics
- Augmented Reality
- Foundation Models
Best for: Research Scientist, AI Scientist, Robotics Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.