PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

PIIBench is a new unified benchmark corpus designed for detecting Personally Identifiable Information (PII) in natural language text. It addresses the fragmentation of existing PII detection resources by consolidating ten publicly available datasets, including synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text. The corpus comprises 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. A normalization pipeline maps over 80 source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression, and creates stratified 80/10/10 train/validation/test splits. Initial evaluations of eight systems, including Microsoft Presidio, spaCy, BERT-base NER, and Piiranha DeBERTa, show all achieve span-level F1 scores below 0.14, with Presidio (F1=0.1385) performing best but still having zero recall on most entity types. This demonstrates PIIBench's significantly higher evaluation challenge compared to single-source PII datasets.

Key takeaway

For AI Architects and Research Scientists developing PII detection systems, PIIBench highlights the severe limitations of current models across diverse data. You should use PIIBench to rigorously evaluate your models, focusing on improving recall for the 48 canonical PII entity types. The low F1 scores (below 0.14) indicate a significant opportunity for innovation in this critical area, urging a re-evaluation of existing approaches.

Key insights

PIIBench unifies fragmented PII datasets into a challenging benchmark, revealing current models' poor detection performance.

Principles

Method

PIIBench consolidates ten datasets, normalizes 80+ label variants to BIO tagging, suppresses rare entity types, and creates stratified 80/10/10 train/validation/test splits.

In practice

Topics

Code references

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.