EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

2026-06-14 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Health & Medical Research · Depth: Expert, quick

Summary

EHRNote-ChatQA is introduced as the first benchmark for evidence-grounded multi-turn clinical question answering using patients' longitudinal discharge summaries. Built from de-identified MIMIC-IV data, it comprises 967 patient-level multi-turn samples, spanning one to five notes, and 16,072 medical-expert-verified QA pairs across eight clinical categories. The benchmark's construction involved an expert-informed pipeline combining structuring schema, curated QA templates, LLM-based generation, and review by 11 medical experts. Initial benchmarking of 22 open- and closed-source LLMs revealed significant challenges: models struggle more with evidence grounding than content, multi-turn errors compound, and single-turn performance does not reliably transfer to this complex setting. The dataset will be publicly available via PhysioNet credentialed access.

Key takeaway

For AI Scientists and Machine Learning Engineers developing clinical NLP systems, you should prioritize evaluating your models against multi-turn, evidence-grounded benchmarks like EHRNote-ChatQA. Your current single-turn QA performance may not reflect real-world clinical utility, especially concerning evidence grounding and error propagation across turns. Consider integrating robust evidence retrieval and multi-turn reasoning mechanisms to address these identified challenges.

Key insights

EHRNote-ChatQA benchmarks LLMs on evidence-grounded, multi-turn clinical QA over longitudinal discharge summaries, revealing current limitations.

Principles

LLM evidence grounding is harder than content answering.
Multi-turn errors compound in clinical QA.
Single-turn QA performance does not transfer.

Method

The benchmark uses an expert-informed pipeline: discharge-summary structuring schema, expert-curated multi-turn QA templates, LLM generation, and 11 medical expert review for every sample.

In practice

Evaluate LLMs on multi-turn clinical QA.
Focus on evidence grounding capabilities.
Access dataset via PhysioNet.

Topics

Clinical Question Answering
Large Language Models
EHRNote-ChatQA Benchmark
Discharge Summaries
Evidence Grounding
MIMIC-IV

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.