GMOS: Grounding Moving Object Segmentation in 3D Space and Time

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GMOS is a novel framework designed for Moving Object Segmentation (MOS), addressing key limitations in existing methods. Current MOS approaches often depend on pre-computed 2D auxiliary data, such as optical flow, and treat motion as a sequence-level attribute, neglecting instantaneous object motion. GMOS overcomes these by directly processing RGB video to generate 3D-aware, temporally fine-grained segmentation of multiple moving objects. A faster variant, GMOS-S, is also available for foreground-background segmentation. To facilitate training and evaluation, the GMOS-2K dataset was curated, comprising 2,210 real-world videos with per-object temporal motion annotations from five Video Object Segmentation benchmarks. The framework also introduces MOS-I, a new temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves leading results across MOS, MOS-I, and Unsupervised VOS benchmarks, while offering significantly faster operation and supporting online inference for streaming applications.

Key takeaway

For Computer Vision Engineers developing real-time Moving Object Segmentation (MOS) systems, GMOS offers a significant advancement. You should consider integrating GMOS for its ability to perform 3D-aware, temporally fine-grained segmentation directly from RGB video, eliminating reliance on 2D auxiliary data. Its superior performance and faster online inference capabilities make it ideal for streaming deployments, potentially simplifying your pipeline and improving accuracy in dynamic environments.

Key insights

GMOS grounds Moving Object Segmentation in 3D space and time, directly from RGB video, for fine-grained, multi-object tracking.

Principles

Ground MOS in 3D space and time.
Address instantaneous object motion.
Utilize direct RGB video processing.

Method

GMOS operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, supporting online inference. It uses the GMOS-2K dataset and MOS-I protocol.

In practice

Deploy GMOS-S for faster foreground-background MOS.
Use GMOS for online inference in streaming.
Evaluate MOS with the MOS-I protocol.

Topics

Moving Object Segmentation
3D Computer Vision
RGB Video Analysis
Online Inference
GMOS-2K Dataset
Video Object Segmentation

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.