PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Summary
PIIBench is a new unified benchmark corpus designed for detecting Personally Identifiable Information (PII) in natural language text. It addresses the fragmentation of existing PII detection resources by consolidating ten publicly available datasets, including synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text. The corpus comprises 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. A normalization pipeline maps over 80 source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression, and creates stratified 80/10/10 train/validation/test splits. Initial evaluations of eight systems, including Microsoft Presidio, spaCy, BERT-base NER, and Piiranha DeBERTa, show all achieve span-level F1 scores below 0.14, with Presidio (F1=0.1385) performing best but still having zero recall on most entity types. This demonstrates PIIBench's significantly higher evaluation challenge compared to single-source PII datasets.
Key takeaway
For AI Architects and Research Scientists developing PII detection systems, PIIBench highlights the severe limitations of current models across diverse data. You should use PIIBench to rigorously evaluate your models, focusing on improving recall for the 48 canonical PII entity types. The low F1 scores (below 0.14) indicate a significant opportunity for innovation in this critical area, urging a re-evaluation of existing approaches.
Key insights
PIIBench unifies fragmented PII datasets into a challenging benchmark, revealing current models' poor detection performance.
Principles
- Standardized annotation schemes improve system comparability.
- Diverse data sources are crucial for robust PII detection.
Method
PIIBench consolidates ten datasets, normalizes 80+ label variants to BIO tagging, suppresses rare entity types, and creates stratified 80/10/10 train/validation/test splits.
In practice
- Use PIIBench for comprehensive PII model evaluation.
- Focus development on low-recall PII entity types.
Topics
- PII Detection
- Benchmark Corpus
- Named Entity Recognition
- Data Normalization
- Evaluation Metrics
Code references
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.