Vision-language models for chest radiography do not always need the image
Summary
A recent study on vision-language models (VLMs) for chest radiography reveals that many models achieve high accuracy by exploiting finding-name priors from text rather than analyzing the image itself. Researchers introduced a causal audit method that intervenes on the image by occluding relevant or irrelevant regions, or swapping in same-label scans from other patients, combined with three behavioral metrics. This audit found that across nine systems, a text-only model without image access reached within 5.7 accuracy points of the best multimodal VLM. A 119-billion-parameter multimodal model was statistically indistinguishable from a 7-billion text-only baseline. The audit categorized models into three that ignore the image, one unstable, and five that use it selectively for specific findings. Critically, against board-certified radiologists, a text-only model matched radiologist accuracy with zero grounding, while image-using models showed comparable grounding rates.
Key takeaway
For AI Scientists evaluating medical vision-language models for clinical deployment, relying solely on accuracy metrics is insufficient and misleading. You must implement causal grounding audits to verify that models genuinely interpret images, rather than just exploiting text-based priors. Prioritize models demonstrating radiologist-comparable grounding rates, as reported confidence only flags ungrounded answers when the model actively uses the image. This ensures your models are robust and safe for real-world diagnostic applications.
Key insights
Many medical vision-language models for chest radiography achieve high accuracy by exploiting text-based finding-name priors rather than actual image interpretation.
Principles
- Accuracy alone is insufficient for VLM clinical deployment.
- Grounding audits are crucial for VLM reliability assessment.
- Text-only priors can mimic image-based performance.
Method
A causal audit intervenes on images by occluding regions or swapping same-label scans, combining three behavioral metrics to test image dependence.
In practice
- Implement causal audits for VLM evaluation.
- Prioritize grounding metrics over raw accuracy.
- Test VLM robustness across datasets and resolutions.
Topics
- Vision-Language Models
- Chest Radiography
- Causal Audits
- Medical AI Evaluation
- Model Grounding
- Diagnostic Accuracy
Best for: AI Architect, CTO, VP of Engineering/Data, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.