Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods

2024-07-12 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

This survey reviews a wide array of data balancing strategies designed to mitigate the challenges of imbalanced datasets in machine learning, where unequal class distributions lead to skewed predictions. It categorizes methods into synthetic oversampling (e.g., SMOTE, Borderline SMOTE), adaptive techniques, generative models (e.g., GANs, VAEs), ensemble-based strategies (e.g., Balanced Bagging, RUSBoost, CSBBoost), hybrid approaches, undersampling (e.g., Random Undersampling, Cluster-Based Under-Sampling, Tomek Links, Near Miss Methods), and neighbor-based methods (e.g., ENN, RENN, NCR, CNN, OSS). The paper covers techniques from early ensemble methods in 1995 to the 2024 CSBBoost algorithm, discussing their functionality, suitability for various dataset characteristics like size, feature types, distribution, dimensionality, and noise, and highlighting practical implementations and future research directions.

Key takeaway

For AI engineers and research scientists developing models on imbalanced datasets, carefully selecting a data balancing strategy is critical. You should evaluate methods like SMOTE, generative models, or ensemble techniques based on your dataset's specific characteristics, such as size, feature types, and noise levels. Matching the technique to your data's properties will significantly improve model performance and reduce prediction bias towards the majority class.

Key insights

Effective data balancing is crucial for accurate machine learning models on imbalanced datasets.

Principles

No single balancing method suits all imbalanced datasets.
Generative models can create high-quality synthetic data.
Ensemble methods often improve predictive accuracy.

Method

Data balancing involves oversampling minority classes, undersampling majority classes, or combining both, often with clustering or generative models to create synthetic, representative samples.

In practice

Use SMOTE for general oversampling of continuous data.
Apply Borderline SMOTE for samples near decision boundaries.
Consider Balanced Random Forest for robust ensemble classification.

Topics

Imbalanced Data
Oversampling Techniques
Undersampling Techniques
SMOTE Algorithm
Generative Models

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.