CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
Summary
The CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework assesses how decision-space presentation, ambiguity, and uncertainty affect Large Language Models' (LLMs) medical reasoning. This framework systematically perturbs the number of plausible answer options, the presence of a ground truth or abstention option, and the semantic framing of answer options. Applying CLEAR across 17 LLMs on three medical benchmarks (MedMCQA, MedQA, and JAMA Clinical Challenges) revealed that increasing plausible answers degrades a model's ability to identify correct answers and abstain from incorrect ones. This lack of caution intensifies when abstention framing shifts from "None of the Above" to "I don't know" (IDK), with IDK's mere presence increasing incorrect selections. The study formalizes a "humility deficit" as the performance gap between identifying correct answers and abstaining from incorrect ones, noting it worsens with model scale. These findings highlight significant limitations in standard medical benchmarks and indicate that scaling alone does not resolve LLM reliability issues in complex, real-world medical scenarios.
Key takeaway
For AI Scientists developing medical LLMs, you must move beyond simplified, exam-style benchmarks. Your evaluation paradigms should incorporate the CLEAR framework's systematic perturbations, including varied distractor counts and nuanced abstention options like "I don't know" or "I need assistance." This will reveal critical "humility deficits" and prevent the deployment of models that aggressively guess rather than cautiously abstain, which is crucial for patient safety in real-world clinical settings.
Key insights
LLMs exhibit a "humility deficit" in medical contexts, struggling with ambiguity and preferring incorrect answers over admitting uncertainty.
Principles
- LLM reliability degrades with increased plausible distractors.
- Abstention framing impacts LLM caution and accuracy.
- Model scaling does not resolve reliability or humility issues.
Method
The CLEAR framework systematically perturbs medical benchmarks by varying distractor count, ground truth/abstention options, and semantic framing of answers to evaluate LLM reliability.
In practice
- Evaluate LLMs with diverse abstention options like "I don't know."
- Test LLMs on real-world, ambiguous clinical scenarios.
- Prioritize humility metrics alongside accuracy in medical LLM development.
Topics
- LLMs for Medicine
- Medical Benchmarking
- CLEAR Framework
- Model Reliability
- Epistemic Humility
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.