X2SAM: Any Segmentation in Images and Videos

2026-05-05 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Natural Language Processing · Depth: Expert, extended

Summary

X2SAM is a unified segmentation-oriented Multimodal Large Language Model (MLLM) designed to extend "any-segmentation" capabilities from images to videos, addressing the limitations of existing models that are often specialized or lack comprehensive prompt support. It integrates an LLM with a novel Mask Memory module to store guided vision features, ensuring temporally consistent video mask generation. X2SAM supports a wide array of segmentation tasks, including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation, across both image and video inputs. The framework introduces the Video Visual Grounded (V-VGD) segmentation benchmark for evaluating object track segmentation from interactive visual prompts in videos. Through a unified joint training strategy over heterogeneous image and video datasets, X2SAM achieves strong video segmentation performance, remains competitive on image benchmarks, and preserves general image and video chat abilities, demonstrating a practical baseline for pixel-level spatio-temporal understanding.

Key takeaway

For research scientists developing multimodal AI, X2SAM offers a robust framework for unified image and video segmentation. Its Mask Memory module and joint training strategy provide a blueprint for achieving temporal consistency and broad task coverage. You should consider X2SAM's architecture for projects requiring precise pixel-level understanding across dynamic visual data, noting its strong performance on reasoning and out-of-domain tasks, while acknowledging potential computational costs for very long videos.

Key insights

X2SAM unifies image and video segmentation via an MLLM and Mask Memory for consistent, prompt-driven mask generation.

Principles

Unified architecture for image and video segmentation.
Mask Memory ensures temporal consistency in videos.
Joint training improves efficiency across modalities.

Method

X2SAM processes textual and visual prompts, using a dual-branch visual extraction architecture and an LLM to guide a Mask Decoder. A Mask Memory module maintains temporal coherence via a FIFO cache of guided vision features.

In practice

Use X2SAM for diverse image and video segmentation tasks.
Apply V-VGD benchmark for video object grounding evaluation.

Topics

X2SAM
Multimodal Large Language Models
Video Segmentation
Mask Memory Module
Visual Grounded Segmentation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.