Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

A new training-based framework addresses the overconfidence of multimodal large language models (MLLMs) in Medical Visual Question Answering (VQA), where existing calibration methods fail to account for multimodal input. This framework finetunes MLLMs using a composite loss function that includes a Brier-style calibration term, an anchor regularizer to prevent extreme confidence, a contrastive image-text alignment term, and a KL-based model stabilization term. The alignment signal is generated via a \$2 \times 2$ factorial perturbation design, crossing image presence with text integrity to probe visual modality reliance versus language priors. Additionally, a top K KL divergence regularizer protects the model's answering ability during finetuning. Applied to three Medical VQA benchmarks and two architectures, MedGemma 4B IT and Qwen2 VL 7B Instruct, the method reduces calibration error by 60% or more and improves discrimination by 26% or more, while maintaining predictive accuracy. It consistently outperforms other prompting, sampling, and training-based approaches, with all experimental code publicly available.

Key takeaway

For AI Scientists and Machine Learning Engineers developing Medical VQA systems, this research offers a robust method to address MLLM overconfidence. Integrating the proposed training-based framework, which uses a composite loss function for finetuning, can reduce calibration error by over 60% and improve discrimination by 26% while preserving predictive accuracy. You should consider adopting this approach to build more reliable and trustworthy medical AI applications, especially where diagnostic certainty is critical.

Key insights

MLLM uncertainty calibration in Medical VQA improves via a composite loss finetuning framework.

Principles

MLLMs in Medical VQA are prone to overconfidence.
Multimodal calibration requires specific multimodal considerations.
Composite loss functions can balance multiple training objectives.

Method

Finetune MLLMs with a composite loss combining Brier calibration, anchor regularization, contrastive image-text alignment, and KL stabilization, plus a top K KL divergence regularizer.

In practice

Apply \$2 \times 2$ factorial perturbation for image-text alignment.
Use a composite loss for MLLM finetuning.
Evaluate calibration error and discrimination metrics.

Topics

Medical VQA
MLLM Calibration
Uncertainty Quantification
Composite Loss Functions
Finetuning
MedGemma

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.