GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GeoSAM-3D is a novel approach for open-vocabulary 3D scene segmentation, operating solely from monocular video input. Unlike methods requiring RGB-D video or multi-view imagery, GeoSAM-3D allows users to upload a short monocular video, select an object in one frame via click or name, and receive a propagated 3D mask across a Gaussian scene. Its implementation integrates frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel. A central design choice is using heat-kernel distance on the reconstructed scene graph for prompt propagation, which enhances continuity around curved surfaces and minimizes leakage between disconnected objects, outperforming Euclidean nearest neighbors. The associated paper details the repository, the "geosam3d.propagate" kernel, the Segment Anything-derived feature head, and validation within the codebase, with an evaluation protocol assessing propagation quality, leakage control, and interactive latency.

Key takeaway

For Computer Vision Engineers developing 3D scene segmentation from monocular video, GeoSAM-3D provides a robust framework. You should consider its geodesic prompt propagation method, which enhances mask continuity on curved surfaces and minimizes leakage between distinct objects. This approach allows you to achieve open-vocabulary 3D segmentation with simpler input requirements, potentially streamlining your data collection and processing pipelines for interactive applications.

Key insights

GeoSAM-3D propagates 3D segmentation prompts using geodesic heat-kernel distance on Gaussian splatting scenes from monocular video, improving mask continuity and reducing leakage.

Principles

Method

GeoSAM-3D combines frozen image/video foundation models with monocular 3D Gaussian Splatting. It then applies a differentiable graph-geodesic propagation kernel over Gaussian centroids for mask propagation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.