Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
Summary
This survey reviews a wide array of data balancing strategies designed to mitigate the challenges of imbalanced datasets in machine learning, where unequal class distributions lead to skewed predictions. It categorizes methods into synthetic oversampling (e.g., SMOTE, Borderline SMOTE), adaptive techniques, generative models (e.g., GANs, VAEs), ensemble-based strategies (e.g., Balanced Bagging, RUSBoost, CSBBoost), hybrid approaches, undersampling (e.g., Random Undersampling, Cluster-Based Under-Sampling, Tomek Links, Near Miss Methods), and neighbor-based methods (e.g., ENN, RENN, NCR, CNN, OSS). The paper covers techniques from early ensemble methods in 1995 to the 2024 CSBBoost algorithm, discussing their functionality, suitability for various dataset characteristics like size, feature types, distribution, dimensionality, and noise, and highlighting practical implementations and future research directions.
Key takeaway
For AI engineers and research scientists developing models on imbalanced datasets, carefully selecting a data balancing strategy is critical. You should evaluate methods like SMOTE, generative models, or ensemble techniques based on your dataset's specific characteristics, such as size, feature types, and noise levels. Matching the technique to your data's properties will significantly improve model performance and reduce prediction bias towards the majority class.
Key insights
Effective data balancing is crucial for accurate machine learning models on imbalanced datasets.
Principles
- No single balancing method suits all imbalanced datasets.
- Generative models can create high-quality synthetic data.
- Ensemble methods often improve predictive accuracy.
Method
Data balancing involves oversampling minority classes, undersampling majority classes, or combining both, often with clustering or generative models to create synthetic, representative samples.
In practice
- Use SMOTE for general oversampling of continuous data.
- Apply Borderline SMOTE for samples near decision boundaries.
- Consider Balanced Random Forest for robust ensemble classification.
Topics
- Imbalanced Data
- Oversampling Techniques
- Undersampling Techniques
- SMOTE Algorithm
- Generative Models
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.