Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Clinical Care & Medical Practice · Depth: Expert, extended

Summary

A new benchmark evaluates how large language models (LLMs) preserve diagnostic uncertainty in clinical text, a critical aspect for safe clinical workflows. The benchmark comprises 1,200 clinical documents with 9,184 proposition-level uncertainty annotations across five distinct levels. Researchers evaluated three LLMs—gpt-oss-120b, gemini-2.5-flash, and claude-haiku-4.5—on two tasks: clinician handoff summarization and patient-friendly revision. Results indicate LLMs poorly preserve original uncertainty cues, often less than half the time, and struggle with nuanced distinctions between adjacent uncertainty levels. A significant finding is a systematic bias towards certainty assertion, with 37.88% to 44.96% of retained targets rewritten as definite claims. While "guarded" prompting improved uncertainty retention by 7.08 to 19.19 percentage points, the best performance reached only 62.53% URR, highlighting persistent challenges.

Key takeaway

For NLP Engineers developing clinical LLM applications, you must prioritize evaluating diagnostic uncertainty preservation. Your current metrics for fluency and factual consistency are insufficient, as LLMs systematically convert uncertain clinical statements into definite assertions, risking patient harm. Implement specific uncertainty preservation metrics and consider fine-tuning or reward signals, as prompt engineering alone offers only partial mitigation.

Key insights

LLMs systematically distort diagnostic uncertainty in clinical text, often converting nuanced statements into definite assertions.

Principles

Clinical meaning relies on both condition and certainty expression.
LLMs show a systematic bias towards certainty assertion.
Explicit instructions alone do not prevent uncertainty distortion.

Method

The study constructed a 1,200-document benchmark with 9,184 proposition-level uncertainty annotations across five levels. It then evaluated three LLMs using indirect (text transformation) and direct (classification/ranking) assessments.

In practice

Evaluate LLMs for uncertainty preservation as a distinct metric.
Be aware of LLM bias towards certainty assertion in clinical outputs.
Consider fine-tuning or reward signals beyond prompt engineering.

Topics

Clinical NLP
Diagnostic Uncertainty
LLM Evaluation
Medical Text Summarization
AI Safety in Healthcare
Model Bias

Code references

HongboD/Clinical-Uncertainty

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.