RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

RealICU is a new hindsight-annotated benchmark designed to evaluate large language models (LLMs) in realistic intensive care unit (ICU) conditions, moving beyond traditional benchmarks that use historical clinician actions as ground truth. Developed by senior physicians reviewing full patient trajectories, RealICU addresses the limitations of suboptimal clinician actions made under incomplete information. The benchmark features four physician-motivated tasks: assessing Patient Status, Acute Problems, Recommended Actions, and Red Flag actions. It partitions patient trajectories into 30-minute windows and includes two datasets: RealICU-Gold, with 930 window annotations from 94 MIMIC-IV patients, and RealICU-Scale, with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs, even memory-augmented ones, performed poorly, exhibiting a recall-safety tradeoff for clinical recommendations and an anchoring bias to early patient interpretations. The authors also introduce ICU-Evo, a structured-memory agent designed to improve long-horizon reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing clinical decision support systems, RealICU highlights critical shortcomings in current LLM performance, particularly regarding safety and bias. You should prioritize developing LLM agents with improved long-horizon reasoning and structured memory, specifically addressing the identified recall-safety tradeoff and anchoring bias to ensure reliable and safe recommendations in high-stakes ICU environments.

Key insights

RealICU is a new benchmark for LLMs in ICU settings, using hindsight-annotated data to overcome limitations of historical clinician actions.

Principles

Method

RealICU formulates four physician-motivated tasks (Patient Status, Acute Problems, Recommended Actions, Red Flag actions) and uses 30-minute windows from MIMIC-IV patient trajectories, with labels created after senior physician review.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.