Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

ViTC-UNet is a novel framework for domain-adaptive semantic segmentation in biomedical imaging, addressing the performance gap of Vision Transformers (ViTs) on sparse, fine-structured, and low signal-to-noise targets. It conditions a UNet architecture on frozen, pre-trained ViT representations using learnable tokens and a two-way attention decoder. This approach combines ViT's global visual priors with UNet's local inductive bias and high-resolution decoding capacity, avoiding computationally expensive end-to-end ViT fine-tuning, even in cross-domain settings. ViTC-UNet, particularly with a DINOv2 backbone, achieved an average foreground mIoU of 0.90 across MRI and CT modalities, outperforming nnU-Net's 0.79 mIoU on 14 of 15 benchmarks. The model also supports incremental learning, allowing new structure tokens to be incorporated without architectural modification.

Key takeaway

For Computer Vision Engineers developing biomedical image segmentation solutions, ViTC-UNet offers a computationally efficient and data-sparing alternative to traditional fine-tuning. You should consider integrating frozen ViT backbones with conditioned UNet decoders to achieve high-fidelity segmentation on intricate anatomical structures, especially in annotation-limited scenarios. This approach allows for flexible label-space expansion without architectural changes, streamlining model adaptation to new biomedical targets.

Key insights

ViTC-UNet combines frozen ViT global priors with UNet local inductive biases for efficient, high-precision biomedical semantic segmentation.

Principles

Method

ViTC-UNet uses a frozen ViT encoder, a learnable conditioning decoder with two-way attention and structure tokens, and an nnU-Net pixel decoder. It injects target-specific ViT guidance across the UNet's multi-stage and multi-scale decoding path.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.