EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries
Summary
EHRNote-ChatQA is introduced as the first benchmark for evidence-grounded multi-turn clinical question answering using patients' longitudinal discharge summaries. Built from de-identified MIMIC-IV data, it comprises 967 patient-level multi-turn samples, spanning one to five notes, and 16,072 medical-expert-verified QA pairs across eight clinical categories. The benchmark's construction involved an expert-informed pipeline combining structuring schema, curated QA templates, LLM-based generation, and review by 11 medical experts. Initial benchmarking of 22 open- and closed-source LLMs revealed significant challenges: models struggle more with evidence grounding than content, multi-turn errors compound, and single-turn performance does not reliably transfer to this complex setting. The dataset will be publicly available via PhysioNet credentialed access.
Key takeaway
For AI Scientists and Machine Learning Engineers developing clinical NLP systems, you should prioritize evaluating your models against multi-turn, evidence-grounded benchmarks like EHRNote-ChatQA. Your current single-turn QA performance may not reflect real-world clinical utility, especially concerning evidence grounding and error propagation across turns. Consider integrating robust evidence retrieval and multi-turn reasoning mechanisms to address these identified challenges.
Key insights
EHRNote-ChatQA benchmarks LLMs on evidence-grounded, multi-turn clinical QA over longitudinal discharge summaries, revealing current limitations.
Principles
- LLM evidence grounding is harder than content answering.
- Multi-turn errors compound in clinical QA.
- Single-turn QA performance does not transfer.
Method
The benchmark uses an expert-informed pipeline: discharge-summary structuring schema, expert-curated multi-turn QA templates, LLM generation, and 11 medical expert review for every sample.
In practice
- Evaluate LLMs on multi-turn clinical QA.
- Focus on evidence grounding capabilities.
- Access dataset via PhysioNet.
Topics
- Clinical Question Answering
- Large Language Models
- EHRNote-ChatQA Benchmark
- Discharge Summaries
- Evidence Grounding
- MIMIC-IV
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.