CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

2026-06-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

CPS4 is introduced as the first text-guided semi-supervised spine segmentation network, leveraging Vision Language Models (VLMs) and class prompts to improve pseudo label quality. It addresses the challenge of ensuring consistency between textual class prompts and specific spine unit regions in multi-class segmentation. The system operates in two stages: a VLM pretraining phase with token- and pixel-level attention loss to enforce semantic coupling between prompts and spine units, followed by a class prompt-driven semi-supervised segmentation stage. This second stage uses the pretrained vision-text encoder to generate class-specific binary maps for unlabeled images, which are then integrated into a unified multi-class segmentation map. CPS4 achieved a superior Dice score of 80.44% using only 5% labeled data on a public spine segmentation dataset, surpassing other semi-supervised and VLM methods.

Key takeaway

For Computer Vision Engineers developing medical image segmentation models with limited labeled data, CPS4 offers a robust approach. You should consider integrating text-guided semi-supervised learning, specifically employing class prompts with explicit consistency constraints, to significantly improve pseudo label quality. This method, demonstrated by CPS4's 80.44% Dice score with only 5% labeled data, can enhance model performance and reduce annotation dependency in your projects.

Key insights

CPS4 enhances semi-supervised spine segmentation by using class prompts with explicit consistency constraints in VLMs.

Principles

Textual class prompts can significantly improve pseudo label quality in semi-supervised segmentation.
Explicit consistency constraints between text prompts and target regions are crucial for multi-class VLM segmentation.

Method

CPS4 employs a two-stage training process: first, VLM pretraining with token- and pixel-level attention loss for prompt-unit consistency; second, using the pretrained encoder to generate and integrate class-specific binary segmentation maps.

In practice

Apply token- and pixel-level attention loss to align text prompts with image regions in VLM-based segmentation.
Integrate class-specific binary maps into a unified multi-class output for improved pseudo label generation.

Topics

Semi-Supervised Learning
Spine Segmentation
Vision Language Models
Class Prompts
Medical Imaging
Attention Mechanisms

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.