Geometry by Division
Summary
This article introduces "partition geometry," a third way to define geometry in machine learning, contrasting it with coordinate and similarity geometries. Partition geometry, exemplified by decision trees and ReLU networks, defines locality through shared membership in discrete regions rather than continuous distance. Decision trees recursively divide input space into leaves, where local rules apply, making complex global problems locally simple. The article explains how random forests aggregate multiple partitions to create smoother, more stable predictions and induce a "forest proximity" similarity function. It also highlights that ReLU networks, despite their different mechanism, operate on the same principle of partitioning input space into convex polyhedral regions. The `geomlearn` Python library provides tools for implementing and analyzing partition geometry, including impurity measures, optimal split search, and partition quality diagnostics.
Key takeaway
For machine learning engineers designing models for data with clear thresholds or conditional interactions, understanding partition geometry is crucial. Your choice of model, whether a decision tree or a ReLU network, implicitly defines how your model perceives data locality. Consider using partition-based methods when interpretability is key or when the problem exhibits regime heterogeneity, and always validate the quality of your partitions using tools like `geomlearn`'s diagnostics to ensure meaningful structural claims.
Key insights
Partition geometry defines data locality and structure through discrete regions and boundaries, simplifying complex global problems locally.
Principles
- Structure becomes tractable when the right geometry makes it simple.
- Tree boundaries are geometrically meaningful, encoding where behavior changes.
- Ensembles of partitions create smoother, more stable, and soft-boundary geometries.
Method
Partition geometry involves recursively dividing input space into regions (cells/leaves) using thresholds or activation functions, then applying simple local rules within each region. The `geomlearn` library offers tools for impurity measurement, optimal split finding, and partition quality analysis.
In practice
- Use `geomlearn.ch05_partitions` for partition analysis.
- Evaluate partitions with between/within ratio and silhouette score.
- Diagnose partition health before trusting tree-based models.
Topics
- Partition Geometry
- Decision Trees
- Random Forests
- ReLU Networks
- Forest Proximity
Code references
Best for: AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.