DIMOS: Disentangling Instance-level Moving Object Segmentation

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The DIMOS (Disentangling Instance-level Moving Object Segmentation) framework addresses challenges in multimodal moving instance segmentation, particularly for small objects and under difficult conditions like fast motion or low light. It introduces a dual-disentangling feature extraction mechanism that separates appearance and motion information within both image and event modalities, enhancing feature density. This is complemented by a multi-granularity cross-modal alignment strategy, which ensures distributional and semantic consistency for effective feature fusion. Evaluated on MouseSIS, SEVD-Fixed, and EVIMO datasets, DIMOS achieved leading performance, including 70.25% mIoU_ins on MouseSIS, 62.05% mIoU_ins on SEVD-Fixed, and 72.08% mIoU_ins on EVIMO. The system was trained using PyTorch with an Adam optimizer for up to 800K iterations on dual A40 GPUs.

Key takeaway

For Computer Vision Engineers developing robust moving instance segmentation systems, especially in challenging conditions like low light or fast motion, your current multimodal fusion strategies may be insufficient. You should consider implementing a dual-disentangling mechanism to separate appearance and motion features within both image and event modalities. This, combined with multi-granularity cross-modal alignment, will significantly enhance feature density and fusion effectiveness, leading to superior accuracy for small instances.

Key insights

Disentangling appearance and motion features within and across modalities significantly improves moving instance segmentation, especially for small objects.

Principles

Separate appearance and motion cues per modality.
Align cross-modal features distributionally and semantically.
Intra-modal contrastive learning strengthens disentanglement.

Method

The DIMOS framework employs dual-disentangling encoders with intra-modal contrastive learning and task-specific supervision. It then uses multi-granularity cross-modal alignment via adversarial domain adaptation and modality translation for robust feature fusion.

In practice

Implement dual-branch encoders for feature extraction.
Utilize contrastive learning for intra-modal separation.
Apply adversarial domain adaptation for cross-modal alignment.

Topics

Moving Instance Segmentation
Event Cameras
Multimodal Fusion
Feature Disentanglement
Cross-Modal Alignment
Autonomous Driving

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.