Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Dual-State Slot Attention (DSSA) is a fully self-supervised framework designed to improve video object-centric learning by addressing limitations in existing slot-based methods. Prior approaches often struggle with maintaining stable object identity in dynamic scenes, such as those with rapid motion or partial occlusion, because they encode both per-frame appearance and cross-frame identity in a single slot vector, leading to objective conflicts and slot swapping. DSSA resolves this by decomposing each slot into a local state for per-frame appearance and a separate identity state for temporal stability. The identity state is updated via a learned recurrent transition acting as a temporal filter, while competition-modulated aggregation (CMA) reduces spurious updates from weakly matching slots. Experiments on MOVi-C, MOVi-D, and YouTube-VIS datasets show DSSA consistently enhances segmentation quality, temporal consistency, object recognition, and video dynamics prediction.

Key takeaway

For Computer Vision Engineers developing robust video analysis systems, Dual-State Slot Attention (DSSA) offers a self-supervised approach to overcome object identity instability. Your implementations should consider decoupling appearance and identity representations, as DSSA demonstrates improved segmentation and temporal consistency across challenging video datasets like MOVi-C, MOVi-D, and YouTube-VIS. This method enhances downstream object recognition and dynamics prediction, providing a more reliable foundation for complex video understanding tasks.

Key insights

Decoupling appearance and identity states significantly improves video object-centric learning stability and performance.

Principles

Method

DSSA decomposes slots into local (appearance) and identity states, updates identity via a learned recurrent transition, and uses competition-modulated aggregation (CMA) to down-weight weak slot updates.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.