REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
Summary
REDACT, a new multilingual benchmark for personally identifiable information (PII) detection, addresses limitations in existing corpora by offering 13,427 records with 324,078 entity annotations across 51 types and 25 languages. It employs a strength-2 covering-array sampler to control nine generation axes, ensuring systematic diversity. Crucially, REDACT introduces entity-level metadata for disclosure status, disclosure form (complete, partial, obfuscated), and a GDPR-aligned sensitivity tier (HIGH, MEDIUM, LOW), enabling stratified evaluation beyond aggregate F1 scores. An evaluation of five detectors, including Presidio, GLiNER, OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6, on a 1,000-record sample revealed that rule-based systems like Presidio perform poorly on HIGH-sensitivity data (recall 0.07) and non-verbatim disclosure forms, whereas LLM detectors maintain robustness, even excelling in the HIGH-sensitivity tier. The benchmark, schema, prompts, and evaluation harness are publicly released.
Key takeaway
For AI Security Engineers evaluating PII detection systems, relying solely on aggregate F1 scores is insufficient and risky. You should adopt stratified evaluation using metrics like REDACT's per-sensitivity-tier and per-disclosure-form recall. This reveals critical architecture-dependent failure modes, such as rule-based systems' poor performance on HIGH-sensitivity data (recall 0.07) and non-verbatim PII. Prioritize LLM-based detectors for their robustness in these high-stakes scenarios to ensure compliance and prevent data breaches.
Key insights
Aggregate F1 scores mask critical PII detector failures, especially for high-stakes, non-verbatim data.
Principles
- PII evaluation needs stratified metrics.
- Rule-based detectors struggle with high-sensitivity PII.
- LLMs show robustness on diverse PII forms.
Method
REDACT uses a seven-stage pipeline with a strength-2 covering-array sampler across nine axes, single-call LLM generation, deterministic offset alignment, and conditional verification to create a diverse PII benchmark.
In practice
- Use REDACT for compliance-aware PII audits.
- Stratify PII recall by sensitivity tier.
- Test detectors on partial/obfuscated forms.
Topics
- PII Detection
- Multilingual Benchmarking
- GDPR Compliance
- Large Language Models
- Stratified Evaluation
- Data Anonymization
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.