Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
Summary
A new training-based framework addresses the overconfidence of multimodal large language models (MLLMs) in Medical Visual Question Answering (VQA), where existing calibration methods fail to account for multimodal input. This framework finetunes MLLMs using a composite loss function that includes a Brier-style calibration term, an anchor regularizer to prevent extreme confidence, a contrastive image-text alignment term, and a KL-based model stabilization term. The alignment signal is generated via a \$2 \times 2$ factorial perturbation design, crossing image presence with text integrity to probe visual modality reliance versus language priors. Additionally, a top K KL divergence regularizer protects the model's answering ability during finetuning. Applied to three Medical VQA benchmarks and two architectures, MedGemma 4B IT and Qwen2 VL 7B Instruct, the method reduces calibration error by 60% or more and improves discrimination by 26% or more, while maintaining predictive accuracy. It consistently outperforms other prompting, sampling, and training-based approaches, with all experimental code publicly available.
Key takeaway
For AI Scientists and Machine Learning Engineers developing Medical VQA systems, this research offers a robust method to address MLLM overconfidence. Integrating the proposed training-based framework, which uses a composite loss function for finetuning, can reduce calibration error by over 60% and improve discrimination by 26% while preserving predictive accuracy. You should consider adopting this approach to build more reliable and trustworthy medical AI applications, especially where diagnostic certainty is critical.
Key insights
MLLM uncertainty calibration in Medical VQA improves via a composite loss finetuning framework.
Principles
- MLLMs in Medical VQA are prone to overconfidence.
- Multimodal calibration requires specific multimodal considerations.
- Composite loss functions can balance multiple training objectives.
Method
Finetune MLLMs with a composite loss combining Brier calibration, anchor regularization, contrastive image-text alignment, and KL stabilization, plus a top K KL divergence regularizer.
In practice
- Apply \$2 \times 2$ factorial perturbation for image-text alignment.
- Use a composite loss for MLLM finetuning.
- Evaluate calibration error and discrimination metrics.
Topics
- Medical VQA
- MLLM Calibration
- Uncertainty Quantification
- Composite Loss Functions
- Finetuning
- MedGemma
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.