Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text
Summary
A new benchmark evaluates how large language models (LLMs) preserve diagnostic uncertainty in clinical text, a critical aspect for safe clinical workflows. The benchmark comprises 1,200 clinical documents with 9,184 proposition-level uncertainty annotations across five distinct levels. Researchers evaluated three LLMs—gpt-oss-120b, gemini-2.5-flash, and claude-haiku-4.5—on two tasks: clinician handoff summarization and patient-friendly revision. Results indicate LLMs poorly preserve original uncertainty cues, often less than half the time, and struggle with nuanced distinctions between adjacent uncertainty levels. A significant finding is a systematic bias towards certainty assertion, with 37.88% to 44.96% of retained targets rewritten as definite claims. While "guarded" prompting improved uncertainty retention by 7.08 to 19.19 percentage points, the best performance reached only 62.53% URR, highlighting persistent challenges.
Key takeaway
For NLP Engineers developing clinical LLM applications, you must prioritize evaluating diagnostic uncertainty preservation. Your current metrics for fluency and factual consistency are insufficient, as LLMs systematically convert uncertain clinical statements into definite assertions, risking patient harm. Implement specific uncertainty preservation metrics and consider fine-tuning or reward signals, as prompt engineering alone offers only partial mitigation.
Key insights
LLMs systematically distort diagnostic uncertainty in clinical text, often converting nuanced statements into definite assertions.
Principles
- Clinical meaning relies on both condition and certainty expression.
- LLMs show a systematic bias towards certainty assertion.
- Explicit instructions alone do not prevent uncertainty distortion.
Method
The study constructed a 1,200-document benchmark with 9,184 proposition-level uncertainty annotations across five levels. It then evaluated three LLMs using indirect (text transformation) and direct (classification/ranking) assessments.
In practice
- Evaluate LLMs for uncertainty preservation as a distinct metric.
- Be aware of LLM bias towards certainty assertion in clinical outputs.
- Consider fine-tuning or reward signals beyond prompt engineering.
Topics
- Clinical NLP
- Diagnostic Uncertainty
- LLM Evaluation
- Medical Text Summarization
- AI Safety in Healthcare
- Model Bias
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.