Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation
Summary
ViTC-UNet is a novel framework for domain-adaptive semantic segmentation in biomedical imaging, addressing the performance gap of Vision Transformers (ViTs) on sparse, fine-structured, and low signal-to-noise targets. It conditions a UNet architecture on frozen, pre-trained ViT representations using learnable tokens and a two-way attention decoder. This approach combines ViT's global visual priors with UNet's local inductive bias and high-resolution decoding capacity, avoiding computationally expensive end-to-end ViT fine-tuning, even in cross-domain settings. ViTC-UNet, particularly with a DINOv2 backbone, achieved an average foreground mIoU of 0.90 across MRI and CT modalities, outperforming nnU-Net's 0.79 mIoU on 14 of 15 benchmarks. The model also supports incremental learning, allowing new structure tokens to be incorporated without architectural modification.
Key takeaway
For Computer Vision Engineers developing biomedical image segmentation solutions, ViTC-UNet offers a computationally efficient and data-sparing alternative to traditional fine-tuning. You should consider integrating frozen ViT backbones with conditioned UNet decoders to achieve high-fidelity segmentation on intricate anatomical structures, especially in annotation-limited scenarios. This approach allows for flexible label-space expansion without architectural changes, streamlining model adaptation to new biomedical targets.
Key insights
ViTC-UNet combines frozen ViT global priors with UNet local inductive biases for efficient, high-precision biomedical semantic segmentation.
Principles
- Frozen ViT backbones can transfer robust visual priors.
- UNets excel at high-resolution, sample-efficient decoding.
- Orthogonal latent spaces improve decoder separability.
Method
ViTC-UNet uses a frozen ViT encoder, a learnable conditioning decoder with two-way attention and structure tokens, and an nnU-Net pixel decoder. It injects target-specific ViT guidance across the UNet's multi-stage and multi-scale decoding path.
In practice
- Use DINOv2 as the ViT backbone for optimal performance.
- Employ a single NVIDIA A100 GPU for training in under 12 hours.
- Incorporate new classes by adding learnable structure tokens.
Topics
- Vision Transformer
- UNet Architecture
- Semantic Segmentation
- Biomedical Imaging
- Domain Adaptation
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.