Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Summary
RadGrounder is a PaliGemma 2-based multi-task Vision-Language Model (VLM) designed for radiology, jointly performing report generation, visual question answering (VQA), and spatial grounding on CT and MRI slices. It was trained using RefRad2D, a large-scale bilingual (German/English) dataset comprising 1.2 million image-text pairs from clinical practice, with automatically derived spatial grounding annotations. This dataset includes 945k CT and 321k MRI slices, and a RefRad2D-Grounded subset of 236,157 grounded slice-text pairs. RadGrounder employs a token-based bounding-box detection strategy for spatial grounding, which proved more effective (G-IoU 43.6 vs. 36.9) than an auxiliary segmentation head and did not degrade language quality. On external VQA benchmarks like Slake and VQA-RAD, RadGrounder achieved competitive performance, with F1 scores of 87.7 and 50.7 respectively, demonstrating the transferability of its clinical training data.
Key takeaway
For AI Scientists and Machine Learning Engineers developing medical VLMs, this research demonstrates a scalable approach to integrate spatial grounding without manual annotation overhead. You should consider adopting automated LLM-driven data curation and token-based bounding box detection to build models like RadGrounder. This strategy enables verifiable, spatially grounded predictions in radiology reports and VQA, enhancing clinical trust without compromising language generation quality.
Key insights
Spatially grounded radiology VLMs can be trained at scale without manual annotations, maintaining language quality.
Principles
- Automated LLM-driven curation scales medical VLM data.
- Token-based bounding box prediction is efficient for grounding.
- Adding grounding supervision does not degrade VQA performance.
Method
An automated pipeline uses TotalSegmentator for 3D segmentation, then GPT-OSS (120B) extracts anatomical mentions from captions and maps them to a unified class schema, linking visual regions to text for training.
In practice
- Use LLM-as-a-judge for robust medical text evaluation.
- Employ dynamic sampling for multi-task VLM training.
- Freeze vision encoder for memory efficiency and faster training.
Topics
- Radiology VLMs
- Spatial Grounding
- Medical Imaging
- CT and MRI
- Automated Annotation
- Vision-Language Models
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.