REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

2026-06-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

REDACT is a new, systematically controlled multilingual benchmark designed for personally identifiable information (PII) detection, addressing limitations in existing corpora. It comprises 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and covers 25 languages across 9 scripts. The benchmark controls nine generation axes—domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence—using a strength-2 covering-array sampler. It includes three entity-level metadata fields (disclosure status, disclosure form, GDPR-aligned sensitivity tier) for stratified evaluation. An evaluation of five detectors (Presidio, GLiNER, OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a 1,000-record sample revealed that aggregate F1 scores mask architecture-dependent failures. Rule-based detectors performed poorly on HIGH-sensitivity data (recall 0.07) and non-verbatim disclosure forms, while LLM detectors demonstrated greater robustness, particularly for the HIGH sensitivity tier. A reference-free LLM-as-judge assessment corroborated that sensitivity-tier assignment is the most challenging task axis. The benchmark, schema, prompts, and evaluation harness are publicly released.

Key takeaway

For NLP Engineers developing PII detection systems, you should prioritize LLM-based architectures, especially when handling high-sensitivity personal information or non-verbatim disclosure forms. Your evaluation strategy must move beyond aggregate F1, incorporating stratified analysis based on sensitivity tiers to accurately identify detector weaknesses. Rule-based systems show significant recall gaps (0.07) on critical data. Consider using the REDACT benchmark for rigorous, controlled testing to ensure robust performance across diverse languages and disclosure types.

Key insights

REDACT offers a controlled multilingual PII benchmark revealing LLMs outperform rule-based systems on high-sensitivity data.

Principles

Aggregate F1 masks architecture-dependent PII detection failures.
LLMs are more robust for high-stakes PII detection.
Sensitivity-tier assignment is a critical, hard PII detection axis.

Method

REDACT uses a strength-2 covering-array sampler to control nine generation axes for PII data, enabling stratified evaluation with three entity-level metadata fields.

In practice

Evaluate PII detectors using stratified sensitivity tiers.
Prioritize LLM-based detectors for high-sensitivity PII.
Use LLM-as-judge for complex PII task assessment.

Topics

PII Detection
Multilingual NLP
LLM Evaluation
Data Privacy
Benchmark Development
GDPR Compliance

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, AI Security Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.