MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

2026-03-11 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Advanced, extended

Summary

The MultiGraSCCo project introduces a multilingual anonymization benchmark, expanding the German-language Graz Synthetic Clinical text Corpus (GraSCCo) with annotations for both direct (PHI) and indirect personal identifiers (IPIs). Leveraging GPT-4.1, the researchers translated the annotated corpus into nine additional languages: English, French, Arabic, Persian, Italian, Polish, Russian, Ukrainian, and Turkish, ensuring cultural and contextual adaptation of names and locations. This benchmark, featuring over 2,500 personal information annotations across 10 languages and 3 writing systems, aims to address the scarcity of privacy-compliant datasets for developing and testing anonymization systems. A human evaluation by medical professionals confirmed the high quality of the translations and the cultural adaptation of personal information. The study also includes monolingual, cross-lingual, and multilingual experiments demonstrating the benchmark's utility for training and evaluating de-identification models, particularly highlighting performance gains with even limited in-language supervision.

Key takeaway

For NLP Engineers developing privacy-enhancing technologies for clinical data, MultiGraSCCo offers a critical resource. You should consider integrating this benchmark to train and validate your anonymization systems, especially for non-English languages, as it provides culturally adapted, annotation-preserved data without real patient information. This can accelerate development and compliance efforts, particularly for detecting subtle indirect personal identifiers that often challenge de-identification models.

Key insights

Synthetic, culturally-adapted multilingual datasets can overcome patient data scarcity for privacy-preserving AI development.

Principles

Machine translation can preserve annotations while adapting cultural context.
Indirect personal identifiers are crucial for robust anonymization.
Limited in-language data significantly boosts multilingual model performance.

Method

The method involves annotating PHI/IPIs in a source corpus, preprocessing for typos/abbreviations, then using GPT-4.1 for annotation-preserving and culturally-adaptive translation into target languages, followed by human and experimental validation.

In practice

Use GPT-4.1 for annotation-preserving translation of clinical texts.
Include IPIs beyond HIPAA categories for stronger anonymization.
Train de-identification models with multilingual data for low-resource languages.

Topics

Multilingual Anonymization
Personal Identifiers
Clinical Data Privacy
Machine Translation
De-identification Benchmarks

Code references

MantisAI/nervaluate

Best for: NLP Engineer, AI Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.