Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

2026-05-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

ViTC-UNet is a novel framework for domain-adaptive semantic segmentation in biomedical imaging, addressing the performance gap of Vision Transformers (ViTs) on sparse, fine-structured, and low signal-to-noise targets. It conditions a UNet architecture on frozen, pre-trained ViT representations using learnable tokens and a two-way attention decoder. This approach combines ViT's global visual priors with UNet's local inductive bias and high-resolution decoding capacity, avoiding computationally expensive end-to-end ViT fine-tuning, even in cross-domain settings. ViTC-UNet, particularly with a DINOv2 backbone, achieved an average foreground mIoU of 0.90 across MRI and CT modalities, outperforming nnU-Net's 0.79 mIoU on 14 of 15 benchmarks. The model also supports incremental learning, allowing new structure tokens to be incorporated without architectural modification.

Key takeaway

For Computer Vision Engineers developing biomedical image segmentation solutions, ViTC-UNet offers a computationally efficient and data-sparing alternative to traditional fine-tuning. You should consider integrating frozen ViT backbones with conditioned UNet decoders to achieve high-fidelity segmentation on intricate anatomical structures, especially in annotation-limited scenarios. This approach allows for flexible label-space expansion without architectural changes, streamlining model adaptation to new biomedical targets.

Key insights

ViTC-UNet combines frozen ViT global priors with UNet local inductive biases for efficient, high-precision biomedical semantic segmentation.

Principles

Frozen ViT backbones can transfer robust visual priors.
UNets excel at high-resolution, sample-efficient decoding.
Orthogonal latent spaces improve decoder separability.

Method

ViTC-UNet uses a frozen ViT encoder, a learnable conditioning decoder with two-way attention and structure tokens, and an nnU-Net pixel decoder. It injects target-specific ViT guidance across the UNet's multi-stage and multi-scale decoding path.

In practice

Use DINOv2 as the ViT backbone for optimal performance.
Employ a single NVIDIA A100 GPU for training in under 12 hours.
Incorporate new classes by adding learnable structure tokens.

Topics

Vision Transformer
UNet Architecture
Semantic Segmentation
Biomedical Imaging
Domain Adaptation

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.