EDEN: A Large-Scale Corpus of Clinical Notes for Italian
Summary
EDEN (Emergency Department Electronic Notes) is a new, large-scale corpus comprising approximately 4 million anonymized clinical notes from Italian Emergency Departments. This dataset covers diverse patient care phases and includes a subset of about six thousand notes manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items, focusing on dyspnea and loss of consciousness situations. The annotation process involved multiple clinicians and iterative revision. EDEN aims to address a significant data gap, supporting the development and application of Large Language Models in concrete medical contexts. The authors detail the data collection, anonymization pipeline, corpus statistics, and annotation scheme, proposing CRF-filling as a novel structured information extraction benchmark. Baseline results are provided using Gemma-27B and MedGemma-27B. EDEN is currently the largest freely available corpus of Italian clinical notes.
Key takeaway
For NLP engineers and research scientists developing medical AI solutions for Italian healthcare, EDEN offers an unprecedented resource. You should integrate this large-scale, anonymized corpus to train and validate Large Language Models, particularly for structured information extraction tasks like CRF-filling. This dataset fills a critical gap, enabling more accurate and contextually relevant AI applications in emergency department settings, potentially improving patient care workflows.
Key insights
EDEN provides the largest freely available Italian clinical note corpus, crucial for medical LLM development.
Method
The method involves a data collection protocol, an on-site anonymization pipeline, and a multi-clinician iterative annotation scheme for a subset of notes.
In practice
- Develop Italian medical LLMs
- Benchmark structured information extraction
- Analyze emergency department patient care
Topics
- Clinical Notes
- Italian Language
- Large Language Models
- Medical NLP
- Dataset
- Anonymization
- Emergency Department
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.