Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

2026-06-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Data Science & Analytics · Depth: Advanced, quick

Summary

RefRad2D is a new large-scale bilingual (German/English) dataset comprising 1.2 million CT and MR image-text pairs derived from clinical practice, designed for training visually grounded vision-language models (VLMs) in radiology without manual spatial annotations. This dataset includes task-specific VQA and spatial grounding subsets, automatically generated via LLM-based curation and automated segmentation. The RadGrounder model, trained on RefRad2D, jointly performs report generation, visual question answering, and spatial grounding through bounding-box detection or segmentation. RadGrounder achieves competitive results on external VQA benchmarks like Slake and VQA-RAD. Integrating this clinical data into training improves open-ended VQA performance over fine-tuning on downstream datasets alone, demonstrating the dataset's transferability. Crucially, adding grounding supervision does not compromise language quality, enabling spatially verifiable outputs without degrading VQA performance.

Key takeaway

For Machine Learning Engineers developing radiology AI, RadGrounder's approach offers a path to scalable, spatially grounded VLMs without extensive manual annotation. You can leverage automated LLM-based curation and segmentation to build large, task-specific datasets like RefRad2D, improving VQA performance and enabling verifiable outputs. Consider integrating spatial grounding early, as it enhances model capabilities without compromising language quality, streamlining development for comprehensive diagnostic tools.

Key insights

Spatially grounded radiology VLMs can be trained at scale without manual annotations using LLM-based curation.

Principles

Automated curation can generate large-scale, task-specific medical imaging datasets.
Adding spatial grounding supervision does not degrade VLM language quality.
Clinical data improves open-ended VQA performance and is transferable.

Method

Generate VQA and spatial grounding subsets automatically via LLM-based curation and automated segmentation from clinical image-text pairs.

In practice

Develop VLMs for joint radiology report generation and VQA.
Implement bounding-box or segmentation for spatial grounding in medical images.

Topics

Vision-Language Models
Radiology AI
Medical Imaging
Spatial Grounding
VQA
Dataset Curation

Best for: NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.