You Have Been Doing Geometry All Along

· Source: Agus’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Intermediate, medium

Summary

This article posits that effective machine learning models are less about the algorithm and more about the geometric space they operate within, arguing that common data science tools inherently involve geometric choices. It categorizes these choices into six families: comparison geometry (scaling, distance functions), projection geometry (regression, PCA), grouping geometry (k-NN, kernel smoothing), partition geometry (decision trees, random forests), latent geometry (embeddings, autoencoders), and observation geometry (k-NN graphs, diffusion maps). The author emphasizes that decisions like variable scaling, using Mahalanobis distance, or interpreting PCA scree plots are not mere preprocessing steps but fundamental geometric modeling choices. The `geomlearn` Python library is introduced with code examples demonstrating PCA, leverage analysis, and Variance Inflation Factor (VIF) as geometric diagnostics.

Key takeaway

For Data Scientists and Machine Learning Engineers building models, understanding the geometric implications of your preprocessing and modeling choices is crucial. You should view scaling, distance metrics, and dimensionality reduction as fundamental geometric decisions, not just routine steps. This perspective will improve your diagnostic capabilities when models underperform and guide more informed method selection, ultimately leading to better model performance and clearer communication of your pipeline's rationale.

Key insights

Data science workflows are fundamentally geometric, shaping how models perceive and learn structure.

Principles

Method

The `geomlearn` library provides tools for geometric diagnostics, including PCA for dimensionality, leverage analysis for extreme observations, and VIF for multicollinearity, aiding in understanding data's underlying geometry.

In practice

Topics

Code references

Best for: Data Scientist, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.