Multimodal foundation models exploit text to make medical image predictions

· Source: Machine learning : nature.com subject feeds · Field: Health & Wellbeing — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology, Clinical Care & Medical Practice · Depth: Expert, short

Summary

A study evaluated 8 proprietary and open-source multimodal foundation models on 1090 medical cases to understand how they integrate image and text data for medical image interpretation. Researchers found that these models' image predictions are predominantly influenced by text, with accuracy improving as the amount of informative text increases. However, this reliance is a "double-edged sword"; even minor misleading textual suggestions can drastically reduce image-based classification performance. For instance, accuracy fell from 84% to 28% when a misleading clinical vignette was introduced. Furthermore, in physician evaluations of 69 long-form clinicopathological conferences, adding images to highly informative text, such as with GPT-4V, either reduced or did not improve diagnostic accuracy. The findings indicate that while multimodal AI models hold promise for medical diagnostics, their performance is heavily, and sometimes detrimentally, driven by textual input.

Key takeaway

For AI Scientists and Machine Learning Engineers developing multimodal medical diagnostic tools, you must critically assess the influence of textual input. Your models' accuracy is highly susceptible to text quality; misleading clinical vignettes can drastically reduce performance. Therefore, prioritize robust text validation and consider scenarios where image input might not add value or could even degrade results when text is already highly informative.

Key insights

Multimodal medical AI models heavily rely on text, which can both enhance and degrade image-based diagnostic accuracy.

Principles

Method

Researchers evaluated 8 multimodal foundation models using 1090 medical cases, analyzing prediction changes with varying text informativeness and misleading text, and assessing image impact on highly informative text cases.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.