You Have Been Doing Geometry All Along
Summary
This article posits that effective machine learning models are less about the algorithm and more about the geometric space they operate within, arguing that common data science tools inherently involve geometric choices. It categorizes these choices into six families: comparison geometry (scaling, distance functions), projection geometry (regression, PCA), grouping geometry (k-NN, kernel smoothing), partition geometry (decision trees, random forests), latent geometry (embeddings, autoencoders), and observation geometry (k-NN graphs, diffusion maps). The author emphasizes that decisions like variable scaling, using Mahalanobis distance, or interpreting PCA scree plots are not mere preprocessing steps but fundamental geometric modeling choices. The `geomlearn` Python library is introduced with code examples demonstrating PCA, leverage analysis, and Variance Inflation Factor (VIF) as geometric diagnostics.
Key takeaway
For Data Scientists and Machine Learning Engineers building models, understanding the geometric implications of your preprocessing and modeling choices is crucial. You should view scaling, distance metrics, and dimensionality reduction as fundamental geometric decisions, not just routine steps. This perspective will improve your diagnostic capabilities when models underperform and guide more informed method selection, ultimately leading to better model performance and clearer communication of your pipeline's rationale.
Key insights
Data science workflows are fundamentally geometric, shaping how models perceive and learn structure.
Principles
- Scaling variables is a geometric modeling decision.
- Covariance describes data's geometric variation.
- Clustering methods embed geometric hypotheses.
Method
The `geomlearn` library provides tools for geometric diagnostics, including PCA for dimensionality, leverage analysis for extreme observations, and VIF for multicollinearity, aiding in understanding data's underlying geometry.
In practice
- Use PCA scree plots to identify latent dimensionality.
- Apply leverage diagnostics to find influential data points.
- Calculate VIF scores to detect multicollinearity.
Topics
- Geometry Discovery
- Geometric Data Analysis
- PCA
- Variance Inflation Factor
- Leverage Diagnostics
Code references
Best for: Data Scientist, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.