X2SAM: Any Segmentation in Images and Videos

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Natural Language Processing · Depth: Expert, extended

Summary

X2SAM is a unified segmentation-oriented Multimodal Large Language Model (MLLM) designed to extend "any-segmentation" capabilities from images to videos, addressing the limitations of existing models that are often specialized or lack comprehensive prompt support. It integrates an LLM with a novel Mask Memory module to store guided vision features, ensuring temporally consistent video mask generation. X2SAM supports a wide array of segmentation tasks, including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation, across both image and video inputs. The framework introduces the Video Visual Grounded (V-VGD) segmentation benchmark for evaluating object track segmentation from interactive visual prompts in videos. Through a unified joint training strategy over heterogeneous image and video datasets, X2SAM achieves strong video segmentation performance, remains competitive on image benchmarks, and preserves general image and video chat abilities, demonstrating a practical baseline for pixel-level spatio-temporal understanding.

Key takeaway

For research scientists developing multimodal AI, X2SAM offers a robust framework for unified image and video segmentation. Its Mask Memory module and joint training strategy provide a blueprint for achieving temporal consistency and broad task coverage. You should consider X2SAM's architecture for projects requiring precise pixel-level understanding across dynamic visual data, noting its strong performance on reasoning and out-of-domain tasks, while acknowledging potential computational costs for very long videos.

Key insights

X2SAM unifies image and video segmentation via an MLLM and Mask Memory for consistent, prompt-driven mask generation.

Principles

Method

X2SAM processes textual and visual prompts, using a dual-branch visual extraction architecture and an LLM to guide a Mask Decoder. A Mask Memory module maintains temporal coherence via a FIFO cache of guided vision features.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.