Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

2018-11-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology · Depth: Expert, long

Summary

RadGrounder is a PaliGemma 2-based multi-task Vision-Language Model (VLM) designed for radiology, jointly performing report generation, visual question answering (VQA), and spatial grounding on CT and MRI slices. It was trained using RefRad2D, a large-scale bilingual (German/English) dataset comprising 1.2 million image-text pairs from clinical practice, with automatically derived spatial grounding annotations. This dataset includes 945k CT and 321k MRI slices, and a RefRad2D-Grounded subset of 236,157 grounded slice-text pairs. RadGrounder employs a token-based bounding-box detection strategy for spatial grounding, which proved more effective (G-IoU 43.6 vs. 36.9) than an auxiliary segmentation head and did not degrade language quality. On external VQA benchmarks like Slake and VQA-RAD, RadGrounder achieved competitive performance, with F1 scores of 87.7 and 50.7 respectively, demonstrating the transferability of its clinical training data.

Key takeaway

For AI Scientists and Machine Learning Engineers developing medical VLMs, this research demonstrates a scalable approach to integrate spatial grounding without manual annotation overhead. You should consider adopting automated LLM-driven data curation and token-based bounding box detection to build models like RadGrounder. This strategy enables verifiable, spatially grounded predictions in radiology reports and VQA, enhancing clinical trust without compromising language generation quality.

Key insights

Spatially grounded radiology VLMs can be trained at scale without manual annotations, maintaining language quality.

Principles

Automated LLM-driven curation scales medical VLM data.
Token-based bounding box prediction is efficient for grounding.
Adding grounding supervision does not degrade VQA performance.

Method

An automated pipeline uses TotalSegmentator for 3D segmentation, then GPT-OSS (120B) extracts anatomical mentions from captions and maps them to a unified class schema, linking visual regions to text for training.

In practice

Use LLM-as-a-judge for robust medical text evaluation.
Employ dynamic sampling for multi-task VLM training.
Freeze vision encoder for memory efficiency and faster training.

Topics

Radiology VLMs
Spatial Grounding
Medical Imaging
CT and MRI
Automated Annotation
Vision-Language Models

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.