REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
Summary
REDACT is a new, systematically controlled multilingual benchmark designed for personally identifiable information (PII) detection, addressing limitations in existing corpora. It comprises 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and covers 25 languages across 9 scripts. The benchmark controls nine generation axes—domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence—using a strength-2 covering-array sampler. It includes three entity-level metadata fields (disclosure status, disclosure form, GDPR-aligned sensitivity tier) for stratified evaluation. An evaluation of five detectors (Presidio, GLiNER, OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a 1,000-record sample revealed that aggregate F1 scores mask architecture-dependent failures. Rule-based detectors performed poorly on HIGH-sensitivity data (recall 0.07) and non-verbatim disclosure forms, while LLM detectors demonstrated greater robustness, particularly for the HIGH sensitivity tier. A reference-free LLM-as-judge assessment corroborated that sensitivity-tier assignment is the most challenging task axis. The benchmark, schema, prompts, and evaluation harness are publicly released.
Key takeaway
For NLP Engineers developing PII detection systems, you should prioritize LLM-based architectures, especially when handling high-sensitivity personal information or non-verbatim disclosure forms. Your evaluation strategy must move beyond aggregate F1, incorporating stratified analysis based on sensitivity tiers to accurately identify detector weaknesses. Rule-based systems show significant recall gaps (0.07) on critical data. Consider using the REDACT benchmark for rigorous, controlled testing to ensure robust performance across diverse languages and disclosure types.
Key insights
REDACT offers a controlled multilingual PII benchmark revealing LLMs outperform rule-based systems on high-sensitivity data.
Principles
- Aggregate F1 masks architecture-dependent PII detection failures.
- LLMs are more robust for high-stakes PII detection.
- Sensitivity-tier assignment is a critical, hard PII detection axis.
Method
REDACT uses a strength-2 covering-array sampler to control nine generation axes for PII data, enabling stratified evaluation with three entity-level metadata fields.
In practice
- Evaluate PII detectors using stratified sensitivity tiers.
- Prioritize LLM-based detectors for high-sensitivity PII.
- Use LLM-as-judge for complex PII task assessment.
Topics
- PII Detection
- Multilingual NLP
- LLM Evaluation
- Data Privacy
- Benchmark Development
- GDPR Compliance
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, AI Security Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.