GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

2026-05-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GeoSAM-3D is a novel approach for open-vocabulary 3D scene segmentation, operating solely from monocular video input. Unlike methods requiring RGB-D video or multi-view imagery, GeoSAM-3D allows users to upload a short monocular video, select an object in one frame via click or name, and receive a propagated 3D mask across a Gaussian scene. Its implementation integrates frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel. A central design choice is using heat-kernel distance on the reconstructed scene graph for prompt propagation, which enhances continuity around curved surfaces and minimizes leakage between disconnected objects, outperforming Euclidean nearest neighbors. The associated paper details the repository, the "geosam3d.propagate" kernel, the Segment Anything-derived feature head, and validation within the codebase, with an evaluation protocol assessing propagation quality, leakage control, and interactive latency.

Key takeaway

For Computer Vision Engineers developing 3D scene segmentation from monocular video, GeoSAM-3D provides a robust framework. You should consider its geodesic prompt propagation method, which enhances mask continuity on curved surfaces and minimizes leakage between distinct objects. This approach allows you to achieve open-vocabulary 3D segmentation with simpler input requirements, potentially streamlining your data collection and processing pipelines for interactive applications.

Key insights

GeoSAM-3D propagates 3D segmentation prompts using geodesic heat-kernel distance on Gaussian splatting scenes from monocular video, improving mask continuity and reducing leakage.

Principles

Geodesic distance improves 3D mask propagation.
Heat-kernel distance prevents leakage across objects.
Monocular video enables lighter 3D segmentation.

Method

GeoSAM-3D combines frozen image/video foundation models with monocular 3D Gaussian Splatting. It then applies a differentiable graph-geodesic propagation kernel over Gaussian centroids for mask propagation.

In practice

Segment objects in 3D from single video frame input.
Generate 3D masks for objects in Gaussian scenes.
Propagate user-defined prompts across 3D scenes.

Topics

3D Scene Segmentation
Monocular Video
Gaussian Splatting
Geodesic Propagation
Open-Vocabulary Segmentation
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.