CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Researchers reproduced Merlin, a dual-encoder vision-language model designed to align 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss. The reproduced model achieved a zero-shot macro F1 score of 74.45% across 30 findings, slightly surpassing the original's 73.00%. The study then investigated the impact of training batch composition, specifically the normal-to-abnormal ratio, finding that balanced sampling (25:75, 50:50, 75:25) consistently underperformed the unbalanced baseline by 2.4 to 2.8 points, with 75:25 being the best balanced variant at 72.02%. Data scaling ablations on a 4,362-study subset showed sub-linear performance scaling from 65.26% to 71.88% when using 20%, 40%, and 100% of the data. Explicit class balancing on this subset further degraded performance to 68.01%, suggesting that random sampling's stochastic diversity and Merlin's alternating batching are more effective than engineered class ratios for 3D medical volumes.

Key takeaway

For Computer Vision Engineers developing vision-language models for 3D medical imaging, you should prioritize stochastic random sampling and alternating batching strategies over explicit class ratio balancing. The research indicates that attempts to engineer specific normal-to-abnormal ratios within training batches can degrade zero-shot diagnostic performance, even with increased data. Focus on leveraging the inherent diversity of random sampling for more robust model regularization.

Key insights

Random sampling and alternating batching are more effective than explicit class balancing for 3D medical vision-language models.

Principles

Stochastic diversity aids regularization.
Performance scales sub-linearly with data.
Explicit class balancing can degrade performance.

Method

The study reproduced a dual-encoder model (Merlin) using symmetric InfoNCE loss to align 3D CT volumes with reports, then varied batch normal-to-abnormal ratios and data scaling.

In practice

Prioritize random sampling over class balancing.
Consider alternating batching for 3D medical data.
Evaluate data sensitivity for individual findings.

Topics

CLIP Architecture
Abdominal CT Imaging
Zero-Shot Learning
Vision-Language Models
Batch Composition

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.