Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

2026-06-10 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Dual-State Slot Attention (DSSA) is a fully self-supervised framework designed to improve video object-centric learning by addressing limitations in existing slot-based methods. Prior approaches often struggle with maintaining stable object identity in dynamic scenes, such as those with rapid motion or partial occlusion, because they encode both per-frame appearance and cross-frame identity in a single slot vector, leading to objective conflicts and slot swapping. DSSA resolves this by decomposing each slot into a local state for per-frame appearance and a separate identity state for temporal stability. The identity state is updated via a learned recurrent transition acting as a temporal filter, while competition-modulated aggregation (CMA) reduces spurious updates from weakly matching slots. Experiments on MOVi-C, MOVi-D, and YouTube-VIS datasets show DSSA consistently enhances segmentation quality, temporal consistency, object recognition, and video dynamics prediction.

Key takeaway

For Computer Vision Engineers developing robust video analysis systems, Dual-State Slot Attention (DSSA) offers a self-supervised approach to overcome object identity instability. Your implementations should consider decoupling appearance and identity representations, as DSSA demonstrates improved segmentation and temporal consistency across challenging video datasets like MOVi-C, MOVi-D, and YouTube-VIS. This method enhances downstream object recognition and dynamics prediction, providing a more reliable foundation for complex video understanding tasks.

Key insights

Decoupling appearance and identity states significantly improves video object-centric learning stability and performance.

Principles

Separating appearance and identity representations resolves objective conflicts in video object learning.
Recurrent temporal filtering on identity states enhances cross-frame consistency.
Competition-modulated aggregation prevents weak slots from absorbing tokens and destabilizing correspondence.

Method

DSSA decomposes slots into local (appearance) and identity states, updates identity via a learned recurrent transition, and uses competition-modulated aggregation (CMA) to down-weight weak slot updates.

In practice

Apply DSSA for robust object segmentation in dynamic video environments.
Utilize DSSA to improve downstream object recognition from video streams.
Implement CMA to mitigate slot-to-object correspondence issues in slot-based models.

Topics

Video Object-Centric Learning
Dual-State Slot Attention
Computer Vision
Object Segmentation
Temporal Consistency
Self-Supervised Learning

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.