Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Science & Research — Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

A new multi-encoder U-Net architecture has been developed for automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI), addressing challenges in integrating cross-modal information and fine-grained lesion semantics. The model incorporates three innovations: an alignment loss to enhance foreground text-image similarity for lesion semantics, a heatmap loss to calibrate similarity maps and suppress background activations, and a confidence-gated multi-head cross-attention refiner for localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. This method achieved Dice 0.7326 and NSD 0.7541 on the PI-CAI dataset, outperforming prior approaches like nnU-Net (Dice 0.7192, NSD 0.7320) and various transformer/SAM-based baselines, establishing a new state-of-the-art.

Key takeaway

For Computer Vision Engineers developing medical image segmentation models, this research demonstrates that integrating text-guided semantic alignment with confidence-gated refinement, orchestrated through phase-scheduled training, significantly improves 3D prostate lesion segmentation. You should consider adopting a similar multi-stage, text-guided approach to enhance both accuracy and interpretability in your volumetric segmentation tasks, especially where fine-grained boundary precision is critical.

Key insights

Text-guided, phase-scheduled training improves 3D prostate lesion segmentation by integrating semantic alignment and confidence-aware refinement.

Principles

Decouple semantic grounding from boundary correction.
Calibrate similarity maps to suppress background activations.
Apply localized edits only in high-confidence regions.

Method

A multi-encoder U-Net uses a bottleneck similarity head for text-image alignment via alignment and heatmap losses, followed by a final-stage, confidence-gated cross-attention refiner, all optimized with a three-phase training curriculum.

In practice

Use "prostate lesion" as a text prompt for BiomedCLIP.
Set similarity head temperature to 1.
Configure gating hyperparameters $\tau=0.35$ and $\alpha=0.25$.

Topics

Prostate Lesion Segmentation
Biparametric MRI
Multi-encoder U-Net
Text-Guided Segmentation
Cross-Attention Refinement

Code references

NUBagciLab/Prostate-Lesion-Segmentation

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.