EDEN: A Large-Scale Corpus of Clinical Notes for Italian

2026-06-10 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

EDEN (Emergency Department Electronic Notes) is a new, large-scale corpus comprising approximately 4 million anonymized clinical notes from Italian Emergency Departments. This dataset covers diverse patient care phases and includes a subset of about six thousand notes manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items, focusing on dyspnea and loss of consciousness situations. The annotation process involved multiple clinicians and iterative revision. EDEN aims to address a significant data gap, supporting the development and application of Large Language Models in concrete medical contexts. The authors detail the data collection, anonymization pipeline, corpus statistics, and annotation scheme, proposing CRF-filling as a novel structured information extraction benchmark. Baseline results are provided using Gemma-27B and MedGemma-27B. EDEN is currently the largest freely available corpus of Italian clinical notes.

Key takeaway

For NLP engineers and research scientists developing medical AI solutions for Italian healthcare, EDEN offers an unprecedented resource. You should integrate this large-scale, anonymized corpus to train and validate Large Language Models, particularly for structured information extraction tasks like CRF-filling. This dataset fills a critical gap, enabling more accurate and contextually relevant AI applications in emergency department settings, potentially improving patient care workflows.

Key insights

EDEN provides the largest freely available Italian clinical note corpus, crucial for medical LLM development.

Method

The method involves a data collection protocol, an on-site anonymization pipeline, and a multi-clinician iterative annotation scheme for a subset of notes.

In practice

Develop Italian medical LLMs
Benchmark structured information extraction
Analyze emergency department patient care

Topics

Clinical Notes
Italian Language
Large Language Models
Medical NLP
Dataset
Anonymization
Emergency Department

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.