SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

2026-06-06 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new systematization of reconstruction attacks on de-identified and synthetic tabular data addresses the scattered study of this adversarial threat. This work introduces the first comprehensive taxonomy, organizing attacks by exploited structure, and presents the most systematic empirical evaluation to date. Researchers pitted fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets, identifying CoBP-RA as the strongest measured attack. Crucially, a new methodology interprets attack success through a memorization test, distinguishing population distribution reconstruction from training record memorization, and reduces reconstruction and membership inference to a single comparable scale. Key findings indicate that the choice of SDG method dictates risk more than the attack itself, differential privacy offers protection primarily at small budgets (ε≲1) before plateauing, and de-identification methods are the most vulnerable. Most reconstruction reflects distributional structure, with individual risk concentrated on atypical records. This research was externally validated by a first-place finish in the 2025 National Institute of Standards and Technology (NIST) Collaborative Research Cycle.

Key takeaway

For AI Security Engineers evaluating synthetic data solutions, understand that the chosen Synthetic Data Generation (SDG) method is the primary determinant of reconstruction attack risk, not the attack sophistication. You should prioritize SDG methods with proven resilience and carefully assess differential privacy budgets, as protection plateaus above ε≲1. Focus your defenses on atypical records, which concentrate individual risk, and rigorously test de-identified data for vulnerabilities before release.

Key insights

Synthetic data's privacy claims are challenged by systematized reconstruction attacks, revealing critical vulnerabilities in current generation methods.

Principles

SDG method choice dictates reconstruction risk.
DP protection plateaus above ε≲1.
De-identification methods are highly exposed.

Method

A methodology for interpreting attack success involves a memorization test and a reduction to compare reconstruction and membership inference on a single scale.

In practice

Evaluate SDG methods for inherent risk.
Prioritize DP budgets below ε≲1.
Scrutinize de-identified data releases.

Topics

Synthetic Tabular Data
Reconstruction Attacks
Attribute Inference
Differential Privacy
Synthetic Data Generation
NIST CRC

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.