AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

AugMask is a novel, plug-and-play training framework designed to adapt score-based diffusion models for incomplete tabular data, a common challenge given that real-world datasets often contain missing values. This framework addresses the limitation of diffusion model backbones that typically assume fully specified inputs. AugMask operates by separating conditioning from supervision, employing two key mechanisms: it constructs numeric inputs through conditional stochastic augmentation using lightweight auxiliary models, and it applies denoising supervision exclusively to observed coordinates. This approach ensures augmented missing entries function as uncertain conditioning context rather than direct training targets. The method connects to a Rao--Blackwellized objective, yielding a variance-weighted sensitivity penalty that prevents over-reliance on uncertain completions. AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines across diverse datasets and missingness regimes.

Key takeaway

For Machine Learning Engineers developing generative models on real-world tabular data with missing values, AugMask offers a robust training framework. It allows standard diffusion models to achieve superior performance compared to specialized missing-aware baselines by intelligently handling incomplete inputs. Consider integrating this plug-and-play method to enhance your model's robustness and accuracy when dealing with imperfect datasets.

Key insights

AugMask adapts diffusion models for incomplete tabular data by separating conditioning from supervision during training.

Principles

Method

Construct numeric inputs via conditional stochastic augmentation using auxiliary models, then apply denoising supervision only to observed coordinates, treating augmented missing entries as uncertain conditioning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.