VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

VideoSEG-O3 is a novel multi-turn reinforcement learning framework designed for Reasoning Video Object Segmentation (RVOS), addressing limitations of existing methods that rely on fixed inputs. Emulating human "coarse-to-fine" cognition, VideoSEG-O3 employs a multi-turn temporal-spatial chain-of-thought to iteratively pinpoint critical intervals and keyframes for fine-grained detail capture. The framework introduces SEG-aware logit calibration, which directly integrates pixel-wise segmentation feedback into token-level logits, enabling the policy to perceive segmentation quality beyond text probability. Additionally, it utilizes a decoupled thinking trace to hierarchically decompose reasoning into temporal, spatial, and linguistic dimensions. VideoSEG-O3 achieves advanced performance across eight RVOS benchmarks, including gains of +6.1% on LongRVOS, +4.2% on MeViS, and +4.0% on ReVOS with a compact 4B model. It also sets a new state-of-the-art on the GroundMoRe benchmark with an Overall ℋ&ℋ score of 31.96%, an +8.83% improvement over MORA.

Key takeaway

For Machine Learning Engineers developing advanced video object segmentation systems, VideoSEG-O3's multi-turn reinforcement learning approach offers a significant paradigm shift. You should consider implementing iterative visual exploration and SEG-aware logit calibration to enhance model precision and adaptability. This framework allows your models to actively refine understanding, leading to superior performance on intricate, long-form video tasks. Explore decoupling reasoning steps for better spatio-temporal grounding.

Key insights

Multi-turn RL with calibrated segmentation feedback enables adaptive, precise video object segmentation.

Principles

Iterative visual exploration refines segmentation.
Pixel-wise feedback improves token-level policy.
Decoupling reasoning enhances understanding.

Method

VideoSEG-O3 formulates RVOS as a Markov Decision Process, using a multi-turn temporal-spatial Chain-of-Thought for active exploration, SEG-aware logit calibration for mask quality, and a decoupled thinking trace.

In practice

Use multi-turn RL for complex video tasks.
Calibrate token logits with mask confidence.
Decompose reasoning into temporal, spatial, linguistic.

Topics

Reasoning Video Object Segmentation
Reinforcement Learning
Multimodal Large Language Models
Chain-of-Thought
Logit Calibration
Temporal-Spatial Reasoning

Code references

Dmmm1997/VideoSEG-O3

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.