A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Medical Devices & Health Technology · Depth: Expert, quick

Summary

A new benchmark study addresses the critical issue of hallucination detection in Vision-Language Models (VLMs) for gastrointestinal (GI) endoscopy, an area largely underexplored compared to radiology. Researchers evaluated nine hallucination detection methods across three categories (black-box, gray-box, and white-box) on the Gut-VLM dataset, which comprises 4,392 test VQA pairs. Five VLMs were tested: MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B. The white-box method, ReXTrust, consistently achieved the highest AUC, peaking at 93.0 on MedGemma-4B, and demonstrated a statistically significant advantage of 19.5 AUC points on average over other methods. Token-level gray-box statistics (MaxEnt, MaxProb) were identified as the strongest non-white-box alternatives. The study also highlighted "confident confabulation" as a systemic failure mode for existing detection approaches.

Key takeaway

For AI scientists deploying VLMs in clinical settings, particularly for GI endoscopy, you should prioritize models that allow white-box access for hallucination detection. Methods like ReXTrust offer significantly higher accuracy, with an average 19.5 AUC point advantage over alternatives. If white-box access is not feasible, explore token-level gray-box statistics such as MaxEnt or MaxProb as the next best option, but remain vigilant for "confident confabulation" as a persistent failure mode.

Key insights

White-box methods offer a significant advantage for detecting hallucinations in VLMs used for gastrointestinal endoscopy.

Principles

White-box access consistently improves VLM hallucination detection.
Token-level gray-box statistics are strong non-white-box alternatives.
Confident confabulation is a systemic VLM failure mode.

Method

Benchmarking nine hallucination detection methods (black-box, gray-box, white-box) on the Gut-VLM dataset (4,392 VQA pairs) across five diverse VLMs to evaluate performance and identify failure modes.

In practice

Prioritize VLMs allowing white-box access for clinical deployment.
Consider token-level gray-box methods if white-box access is unavailable.
Be aware of "confident confabulation" in VLM outputs.

Topics

VLM Hallucination Detection
Gastrointestinal Endoscopy
Clinical AI
ReXTrust
Model Benchmarking
Vision-Language Models

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.