Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

A new training-based framework addresses the overconfidence of multimodal large language models (MLLMs) in Medical Visual Question Answering (VQA), where existing calibration methods fail to account for multimodal input. This framework finetunes MLLMs using a composite loss function that includes a Brier-style calibration term, an anchor regularizer to prevent extreme confidence, a contrastive image-text alignment term, and a KL-based model stabilization term. The alignment signal is generated via a \$2 \times 2$ factorial perturbation design, crossing image presence with text integrity to probe visual modality reliance versus language priors. Additionally, a top K KL divergence regularizer protects the model's answering ability during finetuning. Applied to three Medical VQA benchmarks and two architectures, MedGemma 4B IT and Qwen2 VL 7B Instruct, the method reduces calibration error by 60% or more and improves discrimination by 26% or more, while maintaining predictive accuracy. It consistently outperforms other prompting, sampling, and training-based approaches, with all experimental code publicly available.

Key takeaway

For AI Scientists and Machine Learning Engineers developing Medical VQA systems, this research offers a robust method to address MLLM overconfidence. Integrating the proposed training-based framework, which uses a composite loss function for finetuning, can reduce calibration error by over 60% and improve discrimination by 26% while preserving predictive accuracy. You should consider adopting this approach to build more reliable and trustworthy medical AI applications, especially where diagnostic certainty is critical.

Key insights

MLLM uncertainty calibration in Medical VQA improves via a composite loss finetuning framework.

Principles

Method

Finetune MLLMs with a composite loss combining Brier calibration, anchor regularization, contrastive image-text alignment, and KL stabilization, plus a top K KL divergence regularizer.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.