CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

2026-05-05 · Source: cs.CL updates on arXiv.org · Field: Health & Wellbeing — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology, Clinical Care & Medical Practice · Depth: Expert, extended

Summary

The CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework assesses how decision-space presentation, ambiguity, and uncertainty affect Large Language Models' (LLMs) medical reasoning. This framework systematically perturbs the number of plausible answer options, the presence of a ground truth or abstention option, and the semantic framing of answer options. Applying CLEAR across 17 LLMs on three medical benchmarks (MedMCQA, MedQA, and JAMA Clinical Challenges) revealed that increasing plausible answers degrades a model's ability to identify correct answers and abstain from incorrect ones. This lack of caution intensifies when abstention framing shifts from "None of the Above" to "I don't know" (IDK), with IDK's mere presence increasing incorrect selections. The study formalizes a "humility deficit" as the performance gap between identifying correct answers and abstaining from incorrect ones, noting it worsens with model scale. These findings highlight significant limitations in standard medical benchmarks and indicate that scaling alone does not resolve LLM reliability issues in complex, real-world medical scenarios.

Key takeaway

For AI Scientists developing medical LLMs, you must move beyond simplified, exam-style benchmarks. Your evaluation paradigms should incorporate the CLEAR framework's systematic perturbations, including varied distractor counts and nuanced abstention options like "I don't know" or "I need assistance." This will reveal critical "humility deficits" and prevent the deployment of models that aggressively guess rather than cautiously abstain, which is crucial for patient safety in real-world clinical settings.

Key insights

LLMs exhibit a "humility deficit" in medical contexts, struggling with ambiguity and preferring incorrect answers over admitting uncertainty.

Principles

LLM reliability degrades with increased plausible distractors.
Abstention framing impacts LLM caution and accuracy.
Model scaling does not resolve reliability or humility issues.

Method

The CLEAR framework systematically perturbs medical benchmarks by varying distractor count, ground truth/abstention options, and semantic framing of answers to evaluate LLM reliability.

In practice

Evaluate LLMs with diverse abstention options like "I don't know."
Test LLMs on real-world, ambiguous clinical scenarios.
Prioritize humility metrics alongside accuracy in medical LLM development.

Topics

LLMs for Medicine
Medical Benchmarking
CLEAR Framework
Model Reliability
Epistemic Humility

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.