VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

VideoSEG-O3 is a novel multi-turn reinforcement learning framework designed for Reasoning Video Object Segmentation (RVOS), addressing limitations of existing methods that rely on fixed inputs. Emulating human "coarse-to-fine" cognition, VideoSEG-O3 employs a multi-turn temporal-spatial chain-of-thought to iteratively pinpoint critical intervals and keyframes for fine-grained detail capture. The framework introduces SEG-aware logit calibration, which directly integrates pixel-wise segmentation feedback into token-level logits, enabling the policy to perceive segmentation quality beyond text probability. Additionally, it utilizes a decoupled thinking trace to hierarchically decompose reasoning into temporal, spatial, and linguistic dimensions. VideoSEG-O3 achieves advanced performance across eight RVOS benchmarks, including gains of +6.1% on LongRVOS, +4.2% on MeViS, and +4.0% on ReVOS with a compact 4B model. It also sets a new state-of-the-art on the GroundMoRe benchmark with an Overall ℋ&ℋ score of 31.96%, an +8.83% improvement over MORA.

Key takeaway

For Machine Learning Engineers developing advanced video object segmentation systems, VideoSEG-O3's multi-turn reinforcement learning approach offers a significant paradigm shift. You should consider implementing iterative visual exploration and SEG-aware logit calibration to enhance model precision and adaptability. This framework allows your models to actively refine understanding, leading to superior performance on intricate, long-form video tasks. Explore decoupling reasoning steps for better spatio-temporal grounding.

Key insights

Multi-turn RL with calibrated segmentation feedback enables adaptive, precise video object segmentation.

Principles

Method

VideoSEG-O3 formulates RVOS as a Markov Decision Process, using a multi-turn temporal-spatial Chain-of-Thought for active exploration, SEG-aware logit calibration for mask quality, and a decoupled thinking trace.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.