Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

2026-03-05 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Advanced, extended

Summary

A new counterfactual evaluation framework assesses visual grounding in multimodal medical Visual Question Answering (VQA) models, revealing that Reinforcement Learning with Verifiable Rewards (RLVR) improves accuracy while degrading actual visual dependence. Researchers from MD Anderson Cancer Center, Eisai Inc., and CORD.ai introduced metrics like Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Findings indicate that text-only RLVR achieves negative VRS on PathVQA (performing better with mismatched images) and retains 81% performance with blank images on VQA-RAD. Image-text RLVR reduces image sensitivity to 39.8% overall, and models generate ungrounded visual claims in 38-43% of responses, with image-text RLVR showing a 61% conditional hallucination probability. This suggests accuracy-only rewards enable shortcut exploitation, necessitating grounding-aware evaluation and training objectives.

Key takeaway

For research scientists developing or deploying multimodal medical VQA models, you must move beyond accuracy-only metrics. Your evaluation protocols should incorporate grounding-aware metrics like Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) to ensure models genuinely use visual information. Failure to do so risks deploying models that appear accurate but rely on text shortcuts, potentially leading to critical errors in clinical settings where visual dependence is paramount.

Key insights

Accuracy-only RLVR in medical VQA degrades visual grounding despite improving benchmark accuracy.

Principles

Accuracy metrics alone cannot assess visual grounding.
Text shortcuts in VQA benchmarks enable spurious correlations.
Visual claims can be generated without actual image dependence.

Method

A counterfactual evaluation framework uses real, blank, and shuffled images to measure Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) for detecting ungrounded visual claims.

In practice

Use VRS, IS, and HVRR for comprehensive model evaluation.
Curate benchmarks to ensure questions require visual analysis.
Implement training objectives enforcing explicit image dependence.

Topics

Medical VQA
Visual Grounding
Reinforcement Learning
Large Vision Language Models
Hallucinated Reasoning

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.