StatQuest: Random Forests Part 2: Missing data and clustering

· Source: StatQuest with Josh Starmer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

StatQuest: Random Forests Part 2 presents techniques for handling missing data and sample clustering within random forests. It covers two scenarios: missing data in the original dataset for model building and in new samples for categorization. For original datasets, an initial guess (e.g., most common categorical, median numeric) is made, then iteratively refined. This refinement involves building a random forest, running data through its trees, and constructing a proximity matrix. The matrix quantifies sample similarity by co-occurrence in leaf nodes, normalized by the number of trees. Proximity values then inform better guesses for missing data via average or weighted averages. This iterative imputation continues for 6-7 cycles until convergence. The proximity matrix also enables creating distance matrices, heat maps, and MDS plots. For new samples, two copies are generated (one for each potential category), missing values are imputed iteratively, and the copy classified correctly more frequently by the forest determines the final categorization.

Key takeaway

For Machine Learning Engineers or Data Scientists facing incomplete datasets, you should consider random forest-based imputation as a robust method. This technique allows you to iteratively refine missing values in both training and new samples, improving model accuracy and classification. Additionally, utilize the generated proximity matrix to visualize complex sample relationships through heat maps or MDS plots, gaining deeper insights into your data structure.

Key insights

Random forests can impute missing data and cluster samples by leveraging leaf node co-occurrence for proximity.

Principles

Method

The method for missing data imputation involves initial guessing, building a random forest, generating a proximity matrix from leaf node co-occurrences, and iteratively refining guesses using proximity-weighted averages until convergence.

In practice

Topics

Best for: AI Student, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by StatQuest with Josh Starmer.