EDEN: A Large-Scale Corpus of Clinical Notes for Italian

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Health & Medical Research, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

EDEN (Emergency Department Electronic Notes) is a new, large-scale corpus of approximately 4 million anonymized Italian clinical notes from Emergency Departments, collected between 2021 and 2023. A subset of 5,746 notes has been manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items, covering dyspnea and loss of consciousness cases. This dataset addresses a critical scarcity of real clinical data for Italian, especially for languages other than English. The corpus supports the development and use of Large Language Models in medical applications, introducing CRF-filling as a novel structured information extraction benchmark. Initial zero-shot baselines using Gemma-27B and MedGemma-27B show MedGemma-27B achieved the highest Macro-F1 of 0.702 with group prompting.

Key takeaway

For NLP Engineers and Research Scientists developing clinical information extraction systems for Italian, EDEN offers a crucial resource. You should utilize this large-scale, expert-annotated corpus to train and benchmark models for CRF-filling, particularly exploring few-shot and fine-tuning approaches. The findings suggest prioritizing biomedically adapted LLMs like MedGemma-27B and employing group-level prompting to balance extraction quality and inference efficiency, which is vital for real-world deployment.

Key insights

EDEN provides the largest Italian clinical note corpus and a CRF-filling benchmark for LLM-based information extraction.

Principles

Biomedical pre-training enhances clinical NLP performance.
Grouping items improves LLM extraction efficiency and quality.
Macro-F1 is crucial for imbalanced clinical extraction tasks.

Method

The CRF-filling task involves predicting values for 132 items (binary, categorical, numerical, mixed) from clinical notes, or "unknown" if evidence is insufficient.

In practice

Use MedGemma-27B for Italian clinical text extraction.
Implement group-level prompting for efficiency.
Prioritize Macro-F1 for evaluating imbalanced extraction.

Topics

Clinical NLP
Italian Language Resources
Information Extraction
Case Report Forms
Large Language Models
MedGemma-27B

Code references

PlanTL-GOB-ES/SPACCC

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.