Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning
Summary
A new counterfactual evaluation framework assesses visual grounding in multimodal medical Visual Question Answering (VQA) models, revealing that Reinforcement Learning with Verifiable Rewards (RLVR) improves accuracy while degrading actual visual dependence. Researchers from MD Anderson Cancer Center, Eisai Inc., and CORD.ai introduced metrics like Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Findings indicate that text-only RLVR achieves negative VRS on PathVQA (performing better with mismatched images) and retains 81% performance with blank images on VQA-RAD. Image-text RLVR reduces image sensitivity to 39.8% overall, and models generate ungrounded visual claims in 38-43% of responses, with image-text RLVR showing a 61% conditional hallucination probability. This suggests accuracy-only rewards enable shortcut exploitation, necessitating grounding-aware evaluation and training objectives.
Key takeaway
For research scientists developing or deploying multimodal medical VQA models, you must move beyond accuracy-only metrics. Your evaluation protocols should incorporate grounding-aware metrics like Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) to ensure models genuinely use visual information. Failure to do so risks deploying models that appear accurate but rely on text shortcuts, potentially leading to critical errors in clinical settings where visual dependence is paramount.
Key insights
Accuracy-only RLVR in medical VQA degrades visual grounding despite improving benchmark accuracy.
Principles
- Accuracy metrics alone cannot assess visual grounding.
- Text shortcuts in VQA benchmarks enable spurious correlations.
- Visual claims can be generated without actual image dependence.
Method
A counterfactual evaluation framework uses real, blank, and shuffled images to measure Visual Reliance Score (VRS), Image Sensitivity (IS), and Hallucinated Visual Reasoning Rate (HVRR) for detecting ungrounded visual claims.
In practice
- Use VRS, IS, and HVRR for comprehensive model evaluation.
- Curate benchmarks to ensure questions require visual analysis.
- Implement training objectives enforcing explicit image dependence.
Topics
- Medical VQA
- Visual Grounding
- Reinforcement Learning
- Large Vision Language Models
- Hallucinated Reasoning
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.