Vision-language models for chest radiography do not always need the image

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

A recent study on vision-language models (VLMs) for chest radiography reveals that many models achieve high accuracy by exploiting finding-name priors from text rather than analyzing the image itself. Researchers introduced a causal audit method that intervenes on the image by occluding relevant or irrelevant regions, or swapping in same-label scans from other patients, combined with three behavioral metrics. This audit found that across nine systems, a text-only model without image access reached within 5.7 accuracy points of the best multimodal VLM. A 119-billion-parameter multimodal model was statistically indistinguishable from a 7-billion text-only baseline. The audit categorized models into three that ignore the image, one unstable, and five that use it selectively for specific findings. Critically, against board-certified radiologists, a text-only model matched radiologist accuracy with zero grounding, while image-using models showed comparable grounding rates.

Key takeaway

For AI Scientists evaluating medical vision-language models for clinical deployment, relying solely on accuracy metrics is insufficient and misleading. You must implement causal grounding audits to verify that models genuinely interpret images, rather than just exploiting text-based priors. Prioritize models demonstrating radiologist-comparable grounding rates, as reported confidence only flags ungrounded answers when the model actively uses the image. This ensures your models are robust and safe for real-world diagnostic applications.

Key insights

Many medical vision-language models for chest radiography achieve high accuracy by exploiting text-based finding-name priors rather than actual image interpretation.

Principles

Accuracy alone is insufficient for VLM clinical deployment.
Grounding audits are crucial for VLM reliability assessment.
Text-only priors can mimic image-based performance.

Method

A causal audit intervenes on images by occluding regions or swapping same-label scans, combining three behavioral metrics to test image dependence.

In practice

Implement causal audits for VLM evaluation.
Prioritize grounding metrics over raw accuracy.
Test VLM robustness across datasets and resolutions.

Topics

Vision-Language Models
Chest Radiography
Causal Audits
Medical AI Evaluation
Model Grounding
Diagnostic Accuracy

Best for: AI Architect, CTO, VP of Engineering/Data, AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.