PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

PROSE (Prompted Scene rEgistration) is a novel, training-free method for egocentric scene registration that utilizes pretrained Vision-Language Models (VLMs). This approach addresses the challenges of registering two RGB captures of an indoor space from head-mounted cameras, which typically yield blurry, fast-moving, and partially overlapping views unsuitable for traditional geometric or learned scene-graph methods. PROSE converts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language. It then prompts the same VLM to match object instances across the two sequences, employing object heights as a prior and verifying matches with same/different queries. The method solves for rigid transforms by selecting candidates with strong geometric consensus. PROSE requires no learned parameters, depth sensors, training, or annotated graphs. It outperforms both geometric and learned scene-graph baselines in registration accuracy on the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, using both ground-truth and RGB-reconstructed point clouds, and its generated scene graph is transferable to downstream tasks.

Key takeaway

For Robotics Engineers or AR System Developers building persistent spatial memory, PROSE offers a training-free, RGB-only solution for egocentric scene registration. You can achieve accurate registration without depth sensors or extensive training, overcoming challenges of blurry, fast-moving egocentric data. This method simplifies deployment by using off-the-shelf foundation models, allowing you to integrate robust spatial understanding into your systems more efficiently and reliably.

Key insights

PROSE enables training-free egocentric scene registration by leveraging pretrained VLMs for scene understanding and cross-scan object matching.

Principles

Pretrained VLMs can serve as both scene understanding and matching engines.
Object heights provide a robust prior for egocentric scene registration.
Geometric consensus can validate object matches for rigid transforms.

Method

Lift RGB sequences to object-level 3D scene graphs using foundation models. Prompt VLM for cross-scan object matching, verify with same/different queries, then solve rigid transform via geometric consensus.

In practice

Implement persistent spatial memory for robots.
Enhance AR systems with robust scene registration.
Generate transferable scene graphs for downstream tasks.

Topics

Egocentric Scene Registration
Vision-Language Models
3D Scene Graphs
Robotics
Augmented Reality
Foundation Models

Best for: Research Scientist, AI Scientist, Robotics Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.