Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

Lumbermark is a new robust divisive clustering algorithm designed to identify clusters of varying sizes, densities, and shapes. It operates by iteratively "chopping off" large limbs from a dataset's mutual reachability minimum spanning tree, which helps smooth data distribution and reduce noise influence. The algorithm offers an alternative to HDBSCAN, specifically allowing users to specify the desired number of clusters. A fast, open-source implementation is available in Python and R packages. Benchmarking against 61 datasets, Lumbermark with a smoothing parameter M=5 and min_cluster_factor f=0.25 achieved the best results, outperforming the previous top-ranked Genie algorithm. The study also found that small smoothing factors (M≤10) generally perform best, with the benefit of mutual reachability distance over Euclidean distance being modest.

Key takeaway

For data scientists and AI engineers needing to detect clusters with a pre-defined count, Lumbermark offers a robust and efficient solution. Its ability to handle varying cluster shapes, densities, and sizes, combined with its resistance to outliers, makes it a strong alternative to HDBSCAN when explicit control over the number of clusters is critical. You should consider integrating the open-source Python or R "lumbermark" package into your workflow, especially for low-to-medium intrinsic dimensionality datasets, and experiment with a min_cluster_factor of 0.25 and a smoothing parameter M around 5.

Key insights

Lumbermark is a robust, divisive clustering algorithm that leverages mutual reachability MSTs to detect varied clusters with user-specified counts.

Principles

Method

Lumbermark constructs an M-mutual reachability MST, removes leaves, then iteratively cuts edges in decreasing weight order, ensuring resulting components meet a minimum size (s = f * |T'| / k) until k clusters are formed.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Data Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.