CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Summary
A study reproduced and extended the Merlin dual-encoder model, which aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings, surpassing the original 73.00%. Researchers investigated two main factors: batch composition and data scaling. They found that controlling the normal-to-abnormal ratio within training batches (25:75, 50:50, 75:25) consistently underperformed the unbalanced baseline by 2.4 to 2.8 points, with the 75:25 ratio yielding the best result among balanced variants (72.02%). Data scaling ablations on a 4,362-study subset showed sub-linear performance scaling from 65.26% to 71.88% as data increased from 20% to 100%. Explicit class balancing, even on a near-balanced subset, further degraded performance, indicating that the stochastic diversity of random sampling and Merlin's alternating batching strategy are more effective regularization than engineered class ratios for 3D medical volumes.
Key takeaway
For Computer Vision Engineers developing 3D medical vision-language models, you should prioritize stochastic diversity in your training batches over explicit class balancing. The original Merlin alternating batching strategy, which cycles between full reports and anatomical subsections, provides superior regularization and generalization compared to enforcing fixed normal-to-abnormal ratios. Relying on validation loss alone for checkpoint selection may lead to suboptimal zero-shot performance; instead, evaluate directly on zero-shot metrics.
Key insights
Random sampling and alternating batching outperform explicit class balancing for 3D medical vision-language models.
Principles
- Stochastic diversity acts as an implicit regularizer.
- Data scaling shows sub-linear performance gains.
- Validation loss can be an unreliable proxy for zero-shot performance.
Method
The Merlin framework uses a 3D ResNet152 I3D vision encoder and a Clinical Longformer text encoder, trained with symmetric InfoNCE loss and an alternating batching strategy for full reports and anatomical subsections.
In practice
- Prioritize natural data distribution over engineered class ratios.
- Implement alternating batching for diverse textual contexts.
- Monitor zero-shot metrics directly for checkpoint selection.
Topics
- Merlin Model
- Abdominal CT Imaging
- Contrastive Learning
- Zero-Shot Classification
- Batch Composition
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.