CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Summary
Researchers reproduced Merlin, a dual-encoder vision-language model designed to align 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss. The reproduced model achieved a zero-shot macro F1 score of 74.45% across 30 findings, slightly surpassing the original's 73.00%. The study then investigated the impact of training batch composition, specifically the normal-to-abnormal ratio, finding that balanced sampling (25:75, 50:50, 75:25) consistently underperformed the unbalanced baseline by 2.4 to 2.8 points, with 75:25 being the best balanced variant at 72.02%. Data scaling ablations on a 4,362-study subset showed sub-linear performance scaling from 65.26% to 71.88% when using 20%, 40%, and 100% of the data. Explicit class balancing on this subset further degraded performance to 68.01%, suggesting that random sampling's stochastic diversity and Merlin's alternating batching are more effective than engineered class ratios for 3D medical volumes.
Key takeaway
For Computer Vision Engineers developing vision-language models for 3D medical imaging, you should prioritize stochastic random sampling and alternating batching strategies over explicit class ratio balancing. The research indicates that attempts to engineer specific normal-to-abnormal ratios within training batches can degrade zero-shot diagnostic performance, even with increased data. Focus on leveraging the inherent diversity of random sampling for more robust model regularization.
Key insights
Random sampling and alternating batching are more effective than explicit class balancing for 3D medical vision-language models.
Principles
- Stochastic diversity aids regularization.
- Performance scales sub-linearly with data.
- Explicit class balancing can degrade performance.
Method
The study reproduced a dual-encoder model (Merlin) using symmetric InfoNCE loss to align 3D CT volumes with reports, then varied batch normal-to-abnormal ratios and data scaling.
In practice
- Prioritize random sampling over class balancing.
- Consider alternating batching for 3D medical data.
- Evaluate data sensitivity for individual findings.
Topics
- CLIP Architecture
- Abdominal CT Imaging
- Zero-Shot Learning
- Vision-Language Models
- Batch Composition
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.