CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Researchers reproduced Merlin, a dual-encoder vision-language model designed to align 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss. The reproduced model achieved a zero-shot macro F1 score of 74.45% across 30 findings, slightly surpassing the original's 73.00%. The study then investigated the impact of training batch composition, specifically the normal-to-abnormal ratio, finding that balanced sampling (25:75, 50:50, 75:25) consistently underperformed the unbalanced baseline by 2.4 to 2.8 points, with 75:25 being the best balanced variant at 72.02%. Data scaling ablations on a 4,362-study subset showed sub-linear performance scaling from 65.26% to 71.88% when using 20%, 40%, and 100% of the data. Explicit class balancing on this subset further degraded performance to 68.01%, suggesting that random sampling's stochastic diversity and Merlin's alternating batching are more effective than engineered class ratios for 3D medical volumes.

Key takeaway

For Computer Vision Engineers developing vision-language models for 3D medical imaging, you should prioritize stochastic random sampling and alternating batching strategies over explicit class ratio balancing. The research indicates that attempts to engineer specific normal-to-abnormal ratios within training batches can degrade zero-shot diagnostic performance, even with increased data. Focus on leveraging the inherent diversity of random sampling for more robust model regularization.

Key insights

Random sampling and alternating batching are more effective than explicit class balancing for 3D medical vision-language models.

Principles

Method

The study reproduced a dual-encoder model (Merlin) using symmetric InfoNCE loss to align 3D CT volumes with reports, then varied batch normal-to-abnormal ratios and data scaling.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.