EDEN: A Large-Scale Corpus of Clinical Notes for Italian

· Source: cs.AI updates on arXiv.org · Field: Science & Research — Health & Medical Research, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

EDEN (Emergency Department Electronic Notes) is a new, large-scale corpus of approximately 4 million anonymized Italian clinical notes from Emergency Departments, collected between 2021 and 2023. A subset of 5,746 notes has been manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items, covering dyspnea and loss of consciousness cases. This dataset addresses a critical scarcity of real clinical data for Italian, especially for languages other than English. The corpus supports the development and use of Large Language Models in medical applications, introducing CRF-filling as a novel structured information extraction benchmark. Initial zero-shot baselines using Gemma-27B and MedGemma-27B show MedGemma-27B achieved the highest Macro-F1 of 0.702 with group prompting.

Key takeaway

For NLP Engineers and Research Scientists developing clinical information extraction systems for Italian, EDEN offers a crucial resource. You should utilize this large-scale, expert-annotated corpus to train and benchmark models for CRF-filling, particularly exploring few-shot and fine-tuning approaches. The findings suggest prioritizing biomedically adapted LLMs like MedGemma-27B and employing group-level prompting to balance extraction quality and inference efficiency, which is vital for real-world deployment.

Key insights

EDEN provides the largest Italian clinical note corpus and a CRF-filling benchmark for LLM-based information extraction.

Principles

Method

The CRF-filling task involves predicting values for 132 items (binary, categorical, numerical, mixed) from clinical notes, or "unknown" if evidence is insufficient.

In practice

Topics

Code references

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.