StatQuest: Random Forests Part 2: Missing data and clustering
Summary
StatQuest: Random Forests Part 2 presents techniques for handling missing data and sample clustering within random forests. It covers two scenarios: missing data in the original dataset for model building and in new samples for categorization. For original datasets, an initial guess (e.g., most common categorical, median numeric) is made, then iteratively refined. This refinement involves building a random forest, running data through its trees, and constructing a proximity matrix. The matrix quantifies sample similarity by co-occurrence in leaf nodes, normalized by the number of trees. Proximity values then inform better guesses for missing data via average or weighted averages. This iterative imputation continues for 6-7 cycles until convergence. The proximity matrix also enables creating distance matrices, heat maps, and MDS plots. For new samples, two copies are generated (one for each potential category), missing values are imputed iteratively, and the copy classified correctly more frequently by the forest determines the final categorization.
Key takeaway
For Machine Learning Engineers or Data Scientists facing incomplete datasets, you should consider random forest-based imputation as a robust method. This technique allows you to iteratively refine missing values in both training and new samples, improving model accuracy and classification. Additionally, utilize the generated proximity matrix to visualize complex sample relationships through heat maps or MDS plots, gaining deeper insights into your data structure.
Key insights
Random forests can impute missing data and cluster samples by leveraging leaf node co-occurrence for proximity.
Principles
- Sample similarity is defined by co-occurrence in leaf nodes.
- Iterative refinement improves missing value imputation.
- Proximity matrices enable diverse data visualizations.
Method
The method for missing data imputation involves initial guessing, building a random forest, generating a proximity matrix from leaf node co-occurrences, and iteratively refining guesses using proximity-weighted averages until convergence.
In practice
- Impute missing values in training data using iterative random forest.
- Classify new samples with missing features by comparing imputed copies.
- Visualize sample relationships with heat maps or MDS plots from proximity.
Topics
- Random Forests
- Missing Data Imputation
- Proximity Matrix
- Sample Clustering
- Data Visualization
- Machine Learning
Best for: AI Student, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by StatQuest with Josh Starmer.