Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
Summary
Lumbermark is a new robust divisive clustering algorithm designed to identify clusters of varying sizes, densities, and shapes. It operates by iteratively "chopping off" large limbs from a dataset's mutual reachability minimum spanning tree, which helps smooth data distribution and reduce noise influence. The algorithm offers an alternative to HDBSCAN, specifically allowing users to specify the desired number of clusters. A fast, open-source implementation is available in Python and R packages. Benchmarking against 61 datasets, Lumbermark with a smoothing parameter M=5 and min_cluster_factor f=0.25 achieved the best results, outperforming the previous top-ranked Genie algorithm. The study also found that small smoothing factors (M≤10) generally perform best, with the benefit of mutual reachability distance over Euclidean distance being modest.
Key takeaway
For data scientists and AI engineers needing to detect clusters with a pre-defined count, Lumbermark offers a robust and efficient solution. Its ability to handle varying cluster shapes, densities, and sizes, combined with its resistance to outliers, makes it a strong alternative to HDBSCAN when explicit control over the number of clusters is critical. You should consider integrating the open-source Python or R "lumbermark" package into your workflow, especially for low-to-medium intrinsic dimensionality datasets, and experiment with a min_cluster_factor of 0.25 and a smoothing parameter M around 5.
Key insights
Lumbermark is a robust, divisive clustering algorithm that leverages mutual reachability MSTs to detect varied clusters with user-specified counts.
Principles
- Mutual reachability distance smoothens data distribution.
- Removing leaves enhances clustering robustness.
- Small smoothing factors (M≤10) generally yield optimal results.
Method
Lumbermark constructs an M-mutual reachability MST, removes leaves, then iteratively cuts edges in decreasing weight order, ensuring resulting components meet a minimum size (s = f * |T'| / k) until k clusters are formed.
In practice
- Use Lumbermark when a specific cluster count is needed.
- Set min_cluster_factor f=0.25 for imbalanced cluster sizes.
- Consider M≤10 for the smoothing parameter.
Topics
- Lumbermark Algorithm
- Mutual Reachability MSTs
- Divisive Clustering
- HDBSCAN Alternative
- Outlier Resistance
Code references
- TutteInstitute/fast_hdbscan
- wangyiqiu/hdbscan
- arborx/ArborX
- gagolews/clustering-results-v1
- amueller/information-theoretic-mst
Best for: AI Engineer, Research Scientist, AI Scientist, Data Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.