Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Advanced, short

Summary

The paper "Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries," submitted on June 4, 2026, investigates how Large Language Model (LLM) output for structured extraction from clinical free-text notes varies based on configuration choices. The study utilized MIMIC-IV v3.1 discharge summaries, employing a fixed schema of 17 clinical documentation flags (yes/no/not_documented) and a 47-tag vocabulary for primary admission reasons. Three prompt variants were tested across two model sizes. Cross-prompt agreement, measured by Cohen's kappa on ICD-stratified subsets, showed that for three-way flags, both models achieved similar pooled agreement (median kappa 0.69 and 0.68), with the larger model redistributing agreement. Collapsing the schema to binary resolved most cross-prompt disagreement, pinpointing the "absence-versus-silence" distinction as a key source. For multi-class admission categorization, changing the model reassigned the dominant tag on nearly half of notes, while prompt phrasing affected one in eight, and the larger model significantly reduced reliance on catch-all categories (from 44% to 26%). This research offers a reusable methodology for auditing extraction reproducibility at population scale.

Key takeaway

For NLP Engineers developing LLM-based clinical data extraction systems, you should carefully evaluate model and schema choices, as they significantly impact output sensitivity and reproducibility. Prioritize refining your schema to explicitly handle "absence-versus-silence" distinctions, which are a major source of disagreement. When dealing with multi-class categorizations, recognize that model selection has a greater impact than prompt phrasing on dominant tag assignments, and larger models can reduce reliance on generic categories.

Key insights

LLM-based clinical data extraction sensitivity varies significantly with model and schema, especially regarding "absence-versus-silence" distinctions.

Principles

Method

The study measured sensitivity by fixing the extraction task and varying one choice (prompt, model, schema) at a time, using Cohen's kappa for cross-prompt agreement and paired comparisons.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.