CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

A study reproduced and extended the Merlin dual-encoder model, which aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings, surpassing the original 73.00%. Researchers investigated two main factors: batch composition and data scaling. They found that controlling the normal-to-abnormal ratio within training batches (25:75, 50:50, 75:25) consistently underperformed the unbalanced baseline by 2.4 to 2.8 points, with the 75:25 ratio yielding the best result among balanced variants (72.02%). Data scaling ablations on a 4,362-study subset showed sub-linear performance scaling from 65.26% to 71.88% as data increased from 20% to 100%. Explicit class balancing, even on a near-balanced subset, further degraded performance, indicating that the stochastic diversity of random sampling and Merlin's alternating batching strategy are more effective regularization than engineered class ratios for 3D medical volumes.

Key takeaway

For Computer Vision Engineers developing 3D medical vision-language models, you should prioritize stochastic diversity in your training batches over explicit class balancing. The original Merlin alternating batching strategy, which cycles between full reports and anatomical subsections, provides superior regularization and generalization compared to enforcing fixed normal-to-abnormal ratios. Relying on validation loss alone for checkpoint selection may lead to suboptimal zero-shot performance; instead, evaluate directly on zero-shot metrics.

Key insights

Random sampling and alternating batching outperform explicit class balancing for 3D medical vision-language models.

Principles

Method

The Merlin framework uses a 3D ResNet152 I3D vision encoder and a Clinical Longformer text encoder, trained with symmetric InfoNCE loss and an alternating batching strategy for full reports and anatomical subsections.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.