Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries
Summary
This work investigates the sensitivity of large language model (LLM)-based structured extraction from clinical free-text notes to prompt, model, and schema configurations. Focusing on MIMIC-IV v3.1 discharge summaries, the study used a fixed schema comprising 17 clinical documentation flags (yes/no/not_documented) and a 47-tag vocabulary for primary admission reasons. Three prompt variants were tested across two model sizes. Results showed both models achieved similar pooled cross-prompt agreement (median kappa 0.69 and 0.68) on three-way flags, with the larger model redistributing agreement rather than eliminating effects. Collapsing the schema to binary largely resolved cross-prompt disagreement, pinpointing the "absence-versus-silence" distinction as a key source. For multi-class admission categorization, changing the model reassigns the dominant tag on nearly half of notes, significantly more than prompt phrasing (one in eight). The larger model also reduced reliance on catch-all categories from 44% to 26%. This highlights schema-driven disagreement and model dominance over prompt phrasing in multi-class tasks.
Key takeaway
For Machine Learning Engineers deploying LLMs for clinical discharge summary extraction, you must rigorously evaluate schema design, particularly the handling of "not documented" states, as this significantly impacts extraction agreement. Your choice of LLM model will heavily influence multi-class categorization outcomes, often more than prompt phrasing. Therefore, prioritize testing different models and carefully refine your schema's representation of absence to ensure reliable and reproducible results in production.
Key insights
LLM extraction sensitivity varies significantly with schema design and model choice, especially on absence-versus-silence distinctions.
Principles
- Schema design impacts LLM agreement more than prompt.
- Model choice strongly influences multi-class categorization.
- "Absence-versus-silence" is a key disagreement source.
Method
The study measured extraction sensitivity by varying one configuration choice (prompt, model, schema) at a time on a fixed task, using cross-prompt agreement (Cohen's kappa) and paired comparisons without human ground truth.
In practice
- Audit LLM extraction reproducibility on population data.
- Prioritize schema design, especially for "not documented" states.
- Test multiple models for multi-class extraction tasks.
Topics
- LLM Structured Extraction
- Clinical Text Analysis
- Prompt Engineering
- Model Sensitivity
- Schema Design
- MIMIC-IV
Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.