SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)
Summary
A new systematization of reconstruction attacks on de-identified and synthetic tabular data addresses the scattered study of this adversarial threat. This work introduces the first comprehensive taxonomy, organizing attacks by exploited structure, and presents the most systematic empirical evaluation to date. Researchers pitted fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets, identifying CoBP-RA as the strongest measured attack. Crucially, a new methodology interprets attack success through a memorization test, distinguishing population distribution reconstruction from training record memorization, and reduces reconstruction and membership inference to a single comparable scale. Key findings indicate that the choice of SDG method dictates risk more than the attack itself, differential privacy offers protection primarily at small budgets (ε≲1) before plateauing, and de-identification methods are the most vulnerable. Most reconstruction reflects distributional structure, with individual risk concentrated on atypical records. This research was externally validated by a first-place finish in the 2025 National Institute of Standards and Technology (NIST) Collaborative Research Cycle.
Key takeaway
For AI Security Engineers evaluating synthetic data solutions, understand that the chosen Synthetic Data Generation (SDG) method is the primary determinant of reconstruction attack risk, not the attack sophistication. You should prioritize SDG methods with proven resilience and carefully assess differential privacy budgets, as protection plateaus above ε≲1. Focus your defenses on atypical records, which concentrate individual risk, and rigorously test de-identified data for vulnerabilities before release.
Key insights
Synthetic data's privacy claims are challenged by systematized reconstruction attacks, revealing critical vulnerabilities in current generation methods.
Principles
- SDG method choice dictates reconstruction risk.
- DP protection plateaus above ε≲1.
- De-identification methods are highly exposed.
Method
A methodology for interpreting attack success involves a memorization test and a reduction to compare reconstruction and membership inference on a single scale.
In practice
- Evaluate SDG methods for inherent risk.
- Prioritize DP budgets below ε≲1.
- Scrutinize de-identified data releases.
Topics
- Synthetic Tabular Data
- Reconstruction Attacks
- Attribute Inference
- Differential Privacy
- Synthetic Data Generation
- NIST CRC
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.