SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation
Summary
SSP-SAM is a novel one-stage framework designed to enhance the Segment Anything Model's (SAM) ability to understand natural language for Referring Expression Segmentation (RES) and Generalized RES (GRES). It integrates a Semantic-Spatial Prompt (SSP) encoder, which uses visual and linguistic attention adapters to highlight salient objects and discriminative phrases within visual and linguistic features, respectively. This process generates high-quality SSPs that guide SAM to produce precise segmentation masks based on language descriptions. SSP-SAM achieves state-of-the-art performance on various RES and GRES benchmarks, including RefCOCO, RefCOCO+, RefCOCOg, and ReferIt, and demonstrates improved open-vocabulary performance on the PhraseCut dataset. The model is lightweight, with SSP-SAM-224 requiring only 18.37M trainable parameters and achieving an inference speed of 26.68 ms per sample, outperforming many existing methods in both efficiency and segmentation quality, particularly at strict precision thresholds like Pr@0.9.
Key takeaway
For AI Scientists and Research Scientists working on multimodal segmentation, SSP-SAM offers a robust and efficient one-stage solution for language-guided segmentation. You should consider adopting its Semantic-Spatial Prompt encoder to leverage pre-trained vision-language models like CLIP with SAM, especially for tasks requiring high precision and generalization across classic RES, GRES, and open-vocabulary scenarios. This approach can significantly improve mask quality and inference efficiency compared to complex multi-stage or MLLM-based methods.
Key insights
SSP-SAM enhances SAM's language understanding for segmentation by integrating semantic and spatial cues via a novel prompt encoder.
Principles
- Integrate semantic and spatial cues for robust language-guided segmentation.
- Leverage pre-trained vision-language models (CLIP) to inform SAM's segmentation.
- Auxiliary tasks can improve primary task performance and data utilization.
Method
SSP-SAM employs a Semantic-Spatial Prompt encoder with visual and linguistic attention adapters to refine CLIP features, which are then processed by a prompt generator to create language-conditioned prompts for SAM's mask decoder.
In practice
- Use attention adapters to highlight salient objects and discriminative phrases.
- Incorporate an auxiliary REC task to exploit bounding box annotations.
- Consider fine-tuning SAM's decoder for 1-2% performance gains.
Topics
- Referring Expression Segmentation
- Segment Anything Model
- Semantic-Spatial Prompts
- Generalized RES
- CLIP Features
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Deep Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.