Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries
Summary
The paper "Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries," submitted on June 4, 2026, investigates how Large Language Model (LLM) output for structured extraction from clinical free-text notes varies based on configuration choices. The study utilized MIMIC-IV v3.1 discharge summaries, employing a fixed schema of 17 clinical documentation flags (yes/no/not_documented) and a 47-tag vocabulary for primary admission reasons. Three prompt variants were tested across two model sizes. Cross-prompt agreement, measured by Cohen's kappa on ICD-stratified subsets, showed that for three-way flags, both models achieved similar pooled agreement (median kappa 0.69 and 0.68), with the larger model redistributing agreement. Collapsing the schema to binary resolved most cross-prompt disagreement, pinpointing the "absence-versus-silence" distinction as a key source. For multi-class admission categorization, changing the model reassigned the dominant tag on nearly half of notes, while prompt phrasing affected one in eight, and the larger model significantly reduced reliance on catch-all categories (from 44% to 26%). This research offers a reusable methodology for auditing extraction reproducibility at population scale.
Key takeaway
For NLP Engineers developing LLM-based clinical data extraction systems, you should carefully evaluate model and schema choices, as they significantly impact output sensitivity and reproducibility. Prioritize refining your schema to explicitly handle "absence-versus-silence" distinctions, which are a major source of disagreement. When dealing with multi-class categorizations, recognize that model selection has a greater impact than prompt phrasing on dominant tag assignments, and larger models can reduce reliance on generic categories.
Key insights
LLM-based clinical data extraction sensitivity varies significantly with model and schema, especially regarding "absence-versus-silence" distinctions.
Principles
- Model choice dominates multi-class categorization.
- Schema design impacts disagreement on absence.
- Larger models reduce catch-all category use.
Method
The study measured sensitivity by fixing the extraction task and varying one choice (prompt, model, schema) at a time, using Cohen's kappa for cross-prompt agreement and paired comparisons.
In practice
- Audit LLM extraction reproducibility at scale.
- Refine schemas to clarify "absence-versus-silence".
- Prioritize model choice for multi-class tasks.
Topics
- LLM Structured Extraction
- Clinical Discharge Summaries
- Prompt Engineering
- Model Sensitivity
- Schema Design
- Cohen's Kappa
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.