MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
Summary
The MultiGraSCCo project introduces a multilingual anonymization benchmark, expanding the German-language Graz Synthetic Clinical text Corpus (GraSCCo) with annotations for both direct (PHI) and indirect personal identifiers (IPIs). Leveraging GPT-4.1, the researchers translated the annotated corpus into nine additional languages: English, French, Arabic, Persian, Italian, Polish, Russian, Ukrainian, and Turkish, ensuring cultural and contextual adaptation of names and locations. This benchmark, featuring over 2,500 personal information annotations across 10 languages and 3 writing systems, aims to address the scarcity of privacy-compliant datasets for developing and testing anonymization systems. A human evaluation by medical professionals confirmed the high quality of the translations and the cultural adaptation of personal information. The study also includes monolingual, cross-lingual, and multilingual experiments demonstrating the benchmark's utility for training and evaluating de-identification models, particularly highlighting performance gains with even limited in-language supervision.
Key takeaway
For NLP Engineers developing privacy-enhancing technologies for clinical data, MultiGraSCCo offers a critical resource. You should consider integrating this benchmark to train and validate your anonymization systems, especially for non-English languages, as it provides culturally adapted, annotation-preserved data without real patient information. This can accelerate development and compliance efforts, particularly for detecting subtle indirect personal identifiers that often challenge de-identification models.
Key insights
Synthetic, culturally-adapted multilingual datasets can overcome patient data scarcity for privacy-preserving AI development.
Principles
- Machine translation can preserve annotations while adapting cultural context.
- Indirect personal identifiers are crucial for robust anonymization.
- Limited in-language data significantly boosts multilingual model performance.
Method
The method involves annotating PHI/IPIs in a source corpus, preprocessing for typos/abbreviations, then using GPT-4.1 for annotation-preserving and culturally-adaptive translation into target languages, followed by human and experimental validation.
In practice
- Use GPT-4.1 for annotation-preserving translation of clinical texts.
- Include IPIs beyond HIPAA categories for stronger anonymization.
- Train de-identification models with multilingual data for low-resource languages.
Topics
- Multilingual Anonymization
- Personal Identifiers
- Clinical Data Privacy
- Machine Translation
- De-identification Benchmarks
Code references
Best for: NLP Engineer, AI Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.