PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Summary
PIIBench is a new, unified benchmark corpus designed for detecting Personally Identifiable Information (PII) in natural language text, addressing the fragmentation and incompatibility of existing domain-specific PII datasets. It consolidates ten public datasets, including synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain text, resulting in 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. A normalization pipeline maps over 80 source-specific label variants to a standardized BIO tagging scheme and creates stratified 80/10/10 train/validation/test splits. Initial evaluations of eight published systems, including Microsoft Presidio, spaCy, BERT-base NER, and Piiranha DeBERTa, show all systems achieving span-level F1 scores below 0.14, with the best system (Presidio) at F1=0.1385, indicating PIIBench presents a significantly more challenging evaluation.
Key takeaway
For AI Engineers developing or evaluating PII detection systems, PIIBench offers a critical, unified benchmark to assess model performance comprehensively. Your existing models, even specialized ones, are likely to perform poorly on this challenging dataset, highlighting the need for more robust, generalized PII detection approaches. Utilize the publicly available dataset and evaluation code to benchmark your solutions and identify areas for improvement.
Key insights
PIIBench unifies fragmented PII datasets into a challenging benchmark, revealing current detection systems' limitations.
Principles
- Standardize diverse PII labels for unified evaluation.
- Consolidate multiple data sources for comprehensive PII coverage.
Method
The PIIBench pipeline normalizes 80+ source-specific PII labels to a BIO tagging scheme, suppresses rare entity types, and generates stratified 80/10/10 train/validation/test splits from ten consolidated datasets.
In practice
- Use PIIBench for robust PII detection model evaluation.
- Access the dataset and code at the provided GitHub link.
Topics
- PIIBench
- PII Detection
- Benchmark Corpus
- Data Normalization
- Baseline Evaluation
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.