When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Medical Devices & Health Technology · Depth: Advanced, quick

Summary

A study evaluated the robustness of Large Language Models (LLMs) in healthcare, revealing their significant sensitivity to subtle prompt variations. Researchers conducted a systematic sensitivity analysis on general-purpose models like GPT-3.5 and Llama3, alongside medical-specific LLMs such as ClinicalBERT, BioLlama3, and BioBERT, using the MedMCQA benchmark. The analysis categorized perturbations into natural and adversarial types, examining their impact on model consistency, accuracy, and reliability in clinical reasoning tasks. Findings indicate that medical LLMs are not intrinsically safe; even minor phrasing changes can alter clinical advice, and targeted adversarial prompts can lead to harmful outputs, including incorrect dosages or omitted critical findings. This fragility, particularly under syntactic reordering or misleading contextual cues, is evident across both general-purpose and domain-specific models, highlighting unacceptable unpredictability in high-stakes clinical settings.

Key takeaway

For AI Scientists and clinicians deploying LLMs in healthcare, you must recognize that current models, even specialized ones, are not intrinsically safe. Their extreme sensitivity to minor prompt variations, especially syntactic reordering, can lead to altered clinical advice or dangerous outputs like incorrect dosages. You should prioritize robust prompt engineering and integrate rigorous adversarial testing into your development and deployment pipelines to mitigate patient safety risks.

Key insights

Large Language Models, including medical-specific variants, are critically fragile to prompt variations in healthcare.

Principles

LLM robustness is paramount for safety-critical applications.
Syntactic reordering and contextual cues degrade LLM reliability.
Adversarial prompts can induce clinically dangerous outputs.

Method

Conduct systematic sensitivity analysis on LLMs using benchmarks like MedMCQA, evaluating consistency, accuracy, and reliability against natural and adversarial prompt perturbations.

In practice

Test LLMs for sensitivity to syntactic reordering.
Implement adversarial prompt testing in LLM evaluation.
Validate LLM clinical advice with human experts.

Topics

Large Language Models
Healthcare AI
Prompt Sensitivity
Model Robustness
Adversarial Prompts
Clinical Reasoning

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.