EDEN: A Large-Scale Corpus of Clinical Notes for Italian
Summary
EDEN (Emergency Department Electronic Notes) is a new, large-scale corpus of approximately 4 million anonymized Italian clinical notes from Emergency Departments, collected between 2021 and 2023. A subset of 5,746 notes has been manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items, covering dyspnea and loss of consciousness cases. This dataset addresses a critical scarcity of real clinical data for Italian, especially for languages other than English. The corpus supports the development and use of Large Language Models in medical applications, introducing CRF-filling as a novel structured information extraction benchmark. Initial zero-shot baselines using Gemma-27B and MedGemma-27B show MedGemma-27B achieved the highest Macro-F1 of 0.702 with group prompting.
Key takeaway
For NLP Engineers and Research Scientists developing clinical information extraction systems for Italian, EDEN offers a crucial resource. You should utilize this large-scale, expert-annotated corpus to train and benchmark models for CRF-filling, particularly exploring few-shot and fine-tuning approaches. The findings suggest prioritizing biomedically adapted LLMs like MedGemma-27B and employing group-level prompting to balance extraction quality and inference efficiency, which is vital for real-world deployment.
Key insights
EDEN provides the largest Italian clinical note corpus and a CRF-filling benchmark for LLM-based information extraction.
Principles
- Biomedical pre-training enhances clinical NLP performance.
- Grouping items improves LLM extraction efficiency and quality.
- Macro-F1 is crucial for imbalanced clinical extraction tasks.
Method
The CRF-filling task involves predicting values for 132 items (binary, categorical, numerical, mixed) from clinical notes, or "unknown" if evidence is insufficient.
In practice
- Use MedGemma-27B for Italian clinical text extraction.
- Implement group-level prompting for efficiency.
- Prioritize Macro-F1 for evaluating imbalanced extraction.
Topics
- Clinical NLP
- Italian Language Resources
- Information Extraction
- Case Report Forms
- Large Language Models
- MedGemma-27B
Code references
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.