DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

The DP-CDA algorithm is introduced as an effective data publishing method for generating synthetic datasets with enhanced privacy preservation, particularly for high-dimensional data. This algorithm creates synthetic data by randomly mixing $l$ data samples from the same class and adding carefully tuned Gaussian noise, parameterized by variance terms $\sigma_{x}$ and $\sigma_{y}$, to ensure formal differential privacy guarantees. The privacy accounting for DP-CDA demonstrates a stronger privacy guarantee compared to existing methods, achieving better utility while maintaining a strict privacy level. The effectiveness of DP-CDA is evaluated by training predictive models on the synthetic data and measuring their accuracy, which serves as a utility metric. The research identifies an optimal order of mixing, $l^{*}$, that balances privacy guarantees with predictive accuracy, showing that DP-CDA-generated datasets can achieve superior utility under the same privacy requirements.

Key takeaway

For research scientists developing privacy-preserving machine learning models, DP-CDA offers a computationally efficient method to generate synthetic datasets with strong differential privacy guarantees. You should consider implementing DP-CDA, especially for high-dimensional data, and empirically determine the optimal order of mixture $l^{*}$ to maximize model utility while adhering to strict privacy budgets. This approach can lead to better predictive accuracy compared to traditional methods under equivalent privacy constraints.

Key insights

DP-CDA generates privacy-preserving synthetic datasets by class-specific random mixing and Gaussian noise, optimizing utility and privacy.

Principles

Method

DP-CDA preprocesses data via z-score normalization and $\ell_{2}$-norm clipping. It then uniformly randomly selects $l$ data points per class, averages them, and adds Gaussian noise to generate synthetic feature vectors and one-hot encoded labels.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.