Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

2026-05-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A study benchmarked nine self-supervised learning (SSL) methods across four pretext-task families, pretraining them on 10,412 3D CT scans (1.89 million 2D axial slices). Each pretrained Swin Transformer encoder was integrated into a SwinUNETR-style segmentation network and fine-tuned on nine public segmentation tasks, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was evaluated using the Dice similarity coefficient (DSC), convergence speed, cross-modality transferability (CT-to-MRI), and feature-reuse patterns. Self-distilled masked image transformer (SMIT) achieved the highest overall segmentation accuracy, fastest fine-tuning convergence, and smallest few-shot-to-many-shot performance gap, demonstrating superior data efficiency and consistent feature reuse.

Key takeaway

For research scientists developing medical image segmentation models, especially with limited labeled data, you should prioritize self-distilled masked image transformer (SMIT) pretraining. SMIT demonstrates superior data efficiency and faster convergence, making it a strong candidate for achieving high accuracy and consistent feature reuse across various anatomical structures and modalities like CT and MRI.

Key insights

SMIT, combining MIM and self-distillation, excels in medical image segmentation transfer learning.

Principles

MIM and self-distillation outperform contrastive learning.
SSL choice matters most with limited annotation budgets.

Method

Pretrain Swin Transformer encoders with SSL on 3D CT scans, then fine-tune within a SwinUNETR-style network for segmentation tasks.

In practice

Prioritize SMIT for medical image segmentation.
Focus on SSL method selection for few-shot scenarios.

Topics

Self-Supervised Learning
Image Segmentation
Swin Transformer
Masked Image Modeling
SMIT

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.