RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

2026-05-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

RealICU is a new hindsight-annotated benchmark designed to evaluate large language models (LLMs) in realistic intensive care unit (ICU) conditions, moving beyond traditional benchmarks that use historical clinician actions as ground truth. Developed by senior physicians reviewing full patient trajectories, RealICU addresses the limitations of suboptimal clinician actions made under incomplete information. The benchmark features four physician-motivated tasks: assessing Patient Status, Acute Problems, Recommended Actions, and Red Flag actions. It partitions patient trajectories into 30-minute windows and includes two datasets: RealICU-Gold, with 930 window annotations from 94 MIMIC-IV patients, and RealICU-Scale, with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs, even memory-augmented ones, performed poorly, exhibiting a recall-safety tradeoff for clinical recommendations and an anchoring bias to early patient interpretations. The authors also introduce ICU-Evo, a structured-memory agent designed to improve long-horizon reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing clinical decision support systems, RealICU highlights critical shortcomings in current LLM performance, particularly regarding safety and bias. You should prioritize developing LLM agents with improved long-horizon reasoning and structured memory, specifically addressing the identified recall-safety tradeoff and anchoring bias to ensure reliable and safe recommendations in high-stakes ICU environments.

Key insights

RealICU is a new benchmark for LLMs in ICU settings, using hindsight-annotated data to overcome limitations of historical clinician actions.

Principles

Hindsight annotation improves ground truth.
LLMs show recall-safety tradeoff in clinical tasks.
Anchoring bias affects early patient interpretations.

Method

RealICU formulates four physician-motivated tasks (Patient Status, Acute Problems, Recommended Actions, Red Flag actions) and uses 30-minute windows from MIMIC-IV patient trajectories, with labels created after senior physician review.

In practice

Evaluate LLMs on RealICU for ICU decision support.
Develop agents to mitigate anchoring bias.
Address recall-safety tradeoff in clinical LLMs.

Topics

RealICU Benchmark
LLM Agents
Intensive Care Units
Clinical Decision Support
Long-Context Understanding

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.