REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

REDACT, a new multilingual benchmark for personally identifiable information (PII) detection, addresses limitations in existing corpora by offering 13,427 records with 324,078 entity annotations across 51 types and 25 languages. It employs a strength-2 covering-array sampler to control nine generation axes, ensuring systematic diversity. Crucially, REDACT introduces entity-level metadata for disclosure status, disclosure form (complete, partial, obfuscated), and a GDPR-aligned sensitivity tier (HIGH, MEDIUM, LOW), enabling stratified evaluation beyond aggregate F1 scores. An evaluation of five detectors, including Presidio, GLiNER, OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6, on a 1,000-record sample revealed that rule-based systems like Presidio perform poorly on HIGH-sensitivity data (recall 0.07) and non-verbatim disclosure forms, whereas LLM detectors maintain robustness, even excelling in the HIGH-sensitivity tier. The benchmark, schema, prompts, and evaluation harness are publicly released.

Key takeaway

For AI Security Engineers evaluating PII detection systems, relying solely on aggregate F1 scores is insufficient and risky. You should adopt stratified evaluation using metrics like REDACT's per-sensitivity-tier and per-disclosure-form recall. This reveals critical architecture-dependent failure modes, such as rule-based systems' poor performance on HIGH-sensitivity data (recall 0.07) and non-verbatim PII. Prioritize LLM-based detectors for their robustness in these high-stakes scenarios to ensure compliance and prevent data breaches.

Key insights

Aggregate F1 scores mask critical PII detector failures, especially for high-stakes, non-verbatim data.

Principles

PII evaluation needs stratified metrics.
Rule-based detectors struggle with high-sensitivity PII.
LLMs show robustness on diverse PII forms.

Method

REDACT uses a seven-stage pipeline with a strength-2 covering-array sampler across nine axes, single-call LLM generation, deterministic offset alignment, and conditional verification to create a diverse PII benchmark.

In practice

Use REDACT for compliance-aware PII audits.
Stratify PII recall by sensitivity tier.
Test detectors on partial/obfuscated forms.

Topics

PII Detection
Multilingual Benchmarking
GDPR Compliance
Large Language Models
Stratified Evaluation
Data Anonymization

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.