Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Summary
RefRad2D is a new large-scale bilingual (German/English) dataset comprising 1.2 million CT and MR image-text pairs derived from clinical practice, designed for training visually grounded vision-language models (VLMs) in radiology without manual spatial annotations. This dataset includes task-specific VQA and spatial grounding subsets, automatically generated via LLM-based curation and automated segmentation. The RadGrounder model, trained on RefRad2D, jointly performs report generation, visual question answering, and spatial grounding through bounding-box detection or segmentation. RadGrounder achieves competitive results on external VQA benchmarks like Slake and VQA-RAD. Integrating this clinical data into training improves open-ended VQA performance over fine-tuning on downstream datasets alone, demonstrating the dataset's transferability. Crucially, adding grounding supervision does not compromise language quality, enabling spatially verifiable outputs without degrading VQA performance.
Key takeaway
For Machine Learning Engineers developing radiology AI, RadGrounder's approach offers a path to scalable, spatially grounded VLMs without extensive manual annotation. You can leverage automated LLM-based curation and segmentation to build large, task-specific datasets like RefRad2D, improving VQA performance and enabling verifiable outputs. Consider integrating spatial grounding early, as it enhances model capabilities without compromising language quality, streamlining development for comprehensive diagnostic tools.
Key insights
Spatially grounded radiology VLMs can be trained at scale without manual annotations using LLM-based curation.
Principles
- Automated curation can generate large-scale, task-specific medical imaging datasets.
- Adding spatial grounding supervision does not degrade VLM language quality.
- Clinical data improves open-ended VQA performance and is transferable.
Method
Generate VQA and spatial grounding subsets automatically via LLM-based curation and automated segmentation from clinical image-text pairs.
In practice
- Develop VLMs for joint radiology report generation and VQA.
- Implement bounding-box or segmentation for spatial grounding in medical images.
Topics
- Vision-Language Models
- Radiology AI
- Medical Imaging
- Spatial Grounding
- VQA
- Dataset Curation
Best for: NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.