A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy
Summary
A new benchmark study addresses the critical issue of hallucination detection in Vision-Language Models (VLMs) for gastrointestinal (GI) endoscopy, an area largely underexplored compared to radiology. Researchers evaluated nine hallucination detection methods across three categories (black-box, gray-box, and white-box) on the Gut-VLM dataset, which comprises 4,392 test VQA pairs. Five VLMs were tested: MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B. The white-box method, ReXTrust, consistently achieved the highest AUC, peaking at 93.0 on MedGemma-4B, and demonstrated a statistically significant advantage of 19.5 AUC points on average over other methods. Token-level gray-box statistics (MaxEnt, MaxProb) were identified as the strongest non-white-box alternatives. The study also highlighted "confident confabulation" as a systemic failure mode for existing detection approaches.
Key takeaway
For AI scientists deploying VLMs in clinical settings, particularly for GI endoscopy, you should prioritize models that allow white-box access for hallucination detection. Methods like ReXTrust offer significantly higher accuracy, with an average 19.5 AUC point advantage over alternatives. If white-box access is not feasible, explore token-level gray-box statistics such as MaxEnt or MaxProb as the next best option, but remain vigilant for "confident confabulation" as a persistent failure mode.
Key insights
White-box methods offer a significant advantage for detecting hallucinations in VLMs used for gastrointestinal endoscopy.
Principles
- White-box access consistently improves VLM hallucination detection.
- Token-level gray-box statistics are strong non-white-box alternatives.
- Confident confabulation is a systemic VLM failure mode.
Method
Benchmarking nine hallucination detection methods (black-box, gray-box, white-box) on the Gut-VLM dataset (4,392 VQA pairs) across five diverse VLMs to evaluate performance and identify failure modes.
In practice
- Prioritize VLMs allowing white-box access for clinical deployment.
- Consider token-level gray-box methods if white-box access is unavailable.
- Be aware of "confident confabulation" in VLM outputs.
Topics
- VLM Hallucination Detection
- Gastrointestinal Endoscopy
- Clinical AI
- ReXTrust
- Model Benchmarking
- Vision-Language Models
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.