SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

2026-03-21 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Advanced, extended

Summary

SSP-SAM is a novel one-stage framework designed to enhance the Segment Anything Model's (SAM) ability to understand natural language for Referring Expression Segmentation (RES) and Generalized RES (GRES). It integrates a Semantic-Spatial Prompt (SSP) encoder, which uses visual and linguistic attention adapters to highlight salient objects and discriminative phrases within visual and linguistic features, respectively. This process generates high-quality SSPs that guide SAM to produce precise segmentation masks based on language descriptions. SSP-SAM achieves state-of-the-art performance on various RES and GRES benchmarks, including RefCOCO, RefCOCO+, RefCOCOg, and ReferIt, and demonstrates improved open-vocabulary performance on the PhraseCut dataset. The model is lightweight, with SSP-SAM-224 requiring only 18.37M trainable parameters and achieving an inference speed of 26.68 ms per sample, outperforming many existing methods in both efficiency and segmentation quality, particularly at strict precision thresholds like Pr@0.9.

Key takeaway

For AI Scientists and Research Scientists working on multimodal segmentation, SSP-SAM offers a robust and efficient one-stage solution for language-guided segmentation. You should consider adopting its Semantic-Spatial Prompt encoder to leverage pre-trained vision-language models like CLIP with SAM, especially for tasks requiring high precision and generalization across classic RES, GRES, and open-vocabulary scenarios. This approach can significantly improve mask quality and inference efficiency compared to complex multi-stage or MLLM-based methods.

Key insights

SSP-SAM enhances SAM's language understanding for segmentation by integrating semantic and spatial cues via a novel prompt encoder.

Principles

Integrate semantic and spatial cues for robust language-guided segmentation.
Leverage pre-trained vision-language models (CLIP) to inform SAM's segmentation.
Auxiliary tasks can improve primary task performance and data utilization.

Method

SSP-SAM employs a Semantic-Spatial Prompt encoder with visual and linguistic attention adapters to refine CLIP features, which are then processed by a prompt generator to create language-conditioned prompts for SAM's mask decoder.

In practice

Use attention adapters to highlight salient objects and discriminative phrases.
Incorporate an auxiliary REC task to exploit bounding box annotations.
Consider fine-tuning SAM's decoder for 1-2% performance gains.

Topics

Referring Expression Segmentation
Segment Anything Model
Semantic-Spatial Prompts
Generalized RES
CLIP Features

Code references

WayneTomas/SSP-SAM

Best for: AI Scientist, Research Scientist, AI Researcher, Deep Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.