PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

PIIBench is a new, unified benchmark corpus designed for detecting Personally Identifiable Information (PII) in natural language text, addressing the fragmentation and incompatibility of existing domain-specific PII datasets. It consolidates ten public datasets, including synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain text, resulting in 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. A normalization pipeline maps over 80 source-specific label variants to a standardized BIO tagging scheme and creates stratified 80/10/10 train/validation/test splits. Initial evaluations of eight published systems, including Microsoft Presidio, spaCy, BERT-base NER, and Piiranha DeBERTa, show all systems achieving span-level F1 scores below 0.14, with the best system (Presidio) at F1=0.1385, indicating PIIBench presents a significantly more challenging evaluation.

Key takeaway

For AI Engineers developing or evaluating PII detection systems, PIIBench offers a critical, unified benchmark to assess model performance comprehensively. Your existing models, even specialized ones, are likely to perform poorly on this challenging dataset, highlighting the need for more robust, generalized PII detection approaches. Utilize the publicly available dataset and evaluation code to benchmark your solutions and identify areas for improvement.

Key insights

PIIBench unifies fragmented PII datasets into a challenging benchmark, revealing current detection systems' limitations.

Principles

Method

The PIIBench pipeline normalizes 80+ source-specific PII labels to a BIO tagging scheme, suppresses rare entity types, and generates stratified 80/10/10 train/validation/test splits from ten consolidated datasets.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.