Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

This work investigates the sensitivity of large language model (LLM)-based structured extraction from clinical free-text notes to prompt, model, and schema configurations. Focusing on MIMIC-IV v3.1 discharge summaries, the study used a fixed schema comprising 17 clinical documentation flags (yes/no/not_documented) and a 47-tag vocabulary for primary admission reasons. Three prompt variants were tested across two model sizes. Results showed both models achieved similar pooled cross-prompt agreement (median kappa 0.69 and 0.68) on three-way flags, with the larger model redistributing agreement rather than eliminating effects. Collapsing the schema to binary largely resolved cross-prompt disagreement, pinpointing the "absence-versus-silence" distinction as a key source. For multi-class admission categorization, changing the model reassigns the dominant tag on nearly half of notes, significantly more than prompt phrasing (one in eight). The larger model also reduced reliance on catch-all categories from 44% to 26%. This highlights schema-driven disagreement and model dominance over prompt phrasing in multi-class tasks.

Key takeaway

For Machine Learning Engineers deploying LLMs for clinical discharge summary extraction, you must rigorously evaluate schema design, particularly the handling of "not documented" states, as this significantly impacts extraction agreement. Your choice of LLM model will heavily influence multi-class categorization outcomes, often more than prompt phrasing. Therefore, prioritize testing different models and carefully refine your schema's representation of absence to ensure reliable and reproducible results in production.

Key insights

LLM extraction sensitivity varies significantly with schema design and model choice, especially on absence-versus-silence distinctions.

Principles

Method

The study measured extraction sensitivity by varying one configuration choice (prompt, model, schema) at a time on a fixed task, using cross-prompt agreement (Cohen's kappa) and paired comparisons without human ground truth.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.