Subject-level Inference for Realistic Text Anonymization Evaluation

2025-04-14 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

The SPIA (Subject-level PII Inference Assessment) benchmark introduces a novel approach to evaluating text anonymization, shifting focus from span-based metrics to individual data subjects. This benchmark, comprising 675 documents across legal and online domains, addresses the limitations of traditional methods that fail to capture adversarial inference capabilities and multi-subject scenarios. Experiments with 4 anonymization methods and 6 LLM backbones reveal that even with over 90% PII span masking, subject-level inference protection can drop to 33%, indicating significant residual privacy risks. Furthermore, anonymization strategies focused on a single target subject often leave non-target subjects substantially more exposed, with protection gaps up to 11 percentage points. The study emphasizes that anonymization effectiveness varies significantly by document type, necessitating domain-aware evaluation.

Key takeaway

For engineering teams developing or deploying text anonymization solutions, relying solely on span-based metrics like token or entity recall is insufficient and creates a false sense of security. You should integrate subject-level inference evaluation using benchmarks like SPIA to accurately assess residual privacy risks, especially in multi-subject documents. Prioritize anonymization techniques that explicitly protect all individuals and adapt strategies based on document domain and length to ensure robust and equitable privacy safeguards.

Key insights

Subject-level inference evaluation is crucial for realistic text anonymization, as span masking alone is insufficient.

Principles

Span-based metrics overestimate privacy protection.
Single-subject anonymization creates protection inequality.
Anonymization effectiveness is domain-dependent.

Method

SPIA employs a two-stage framework: identifying all data subjects within a document, then inferring 15 PII categories for each subject, using novel Individual Protection Rate (IPR) and Collective Protection Rate (CPR) metrics.

In practice

Use SPIA for multi-subject, inference-based privacy assessment.
Prioritize LLM-based anonymization for better contextual protection.
Tailor anonymization strategies to document characteristics.

Topics

Text Anonymization
SPIA Benchmark
Subject-level PII Inference
Privacy Protection Metrics
Large Language Models

Code references

maisonOP/spia

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.