RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Summary
RealICU is a new hindsight-annotated benchmark designed to evaluate large language models (LLMs) in realistic intensive care unit (ICU) conditions, moving beyond traditional benchmarks that use historical clinician actions as ground truth. Developed by senior physicians reviewing full patient trajectories, RealICU addresses the limitations of suboptimal clinician actions made under incomplete information. The benchmark features four physician-motivated tasks: assessing Patient Status, Acute Problems, Recommended Actions, and Red Flag actions. It partitions patient trajectories into 30-minute windows and includes two datasets: RealICU-Gold, with 930 window annotations from 94 MIMIC-IV patients, and RealICU-Scale, with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs, even memory-augmented ones, performed poorly, exhibiting a recall-safety tradeoff for clinical recommendations and an anchoring bias to early patient interpretations. The authors also introduce ICU-Evo, a structured-memory agent designed to improve long-horizon reasoning.
Key takeaway
For AI Scientists and Machine Learning Engineers developing clinical decision support systems, RealICU highlights critical shortcomings in current LLM performance, particularly regarding safety and bias. You should prioritize developing LLM agents with improved long-horizon reasoning and structured memory, specifically addressing the identified recall-safety tradeoff and anchoring bias to ensure reliable and safe recommendations in high-stakes ICU environments.
Key insights
RealICU is a new benchmark for LLMs in ICU settings, using hindsight-annotated data to overcome limitations of historical clinician actions.
Principles
- Hindsight annotation improves ground truth.
- LLMs show recall-safety tradeoff in clinical tasks.
- Anchoring bias affects early patient interpretations.
Method
RealICU formulates four physician-motivated tasks (Patient Status, Acute Problems, Recommended Actions, Red Flag actions) and uses 30-minute windows from MIMIC-IV patient trajectories, with labels created after senior physician review.
In practice
- Evaluate LLMs on RealICU for ICU decision support.
- Develop agents to mitigate anchoring bias.
- Address recall-safety tradeoff in clinical LLMs.
Topics
- RealICU Benchmark
- LLM Agents
- Intensive Care Units
- Clinical Decision Support
- Long-Context Understanding
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.