The Kernel as the Normal Form
Summary
Five prominent machine learning methods—Support Vector Machine, Gradient Boosting Machine, Gaussian Process posterior mean, Kernel Ridge Regression, Nadaraya–Watson smoothing, and a single attention head—are presented as variations of a single "kernel machine." The article argues that these methods, when applied to tabular data, fundamentally differ only in their underlying "geometry," defined by the kernel, rather than distinct predictive ideas. The core mechanism involves fixing a notion of similarity between data points, forming a Gram matrix of pairwise similarities, and solving a linear system. For example, Kernel Ridge Regression (KRR) is algebraically identical to the Gaussian Process posterior mean, both yielding \$338k on California housing data. A single attention head is shown to be equivalent to Nadaraya–Watson smoothing, both returning \$215k. The Support Vector Machine, using hinge loss, produced \$321k, leveraging 3,139 of 4,000 points as support vectors. This reframing suggests that selecting a "method" is largely a choice of geometry, with the kernel being the critical determinant.
Key takeaway
For Machine Learning Engineers selecting models for tabular data, recognize that many common methods are fundamentally kernel machines. Your primary decision should focus on defining the data's geometry via the kernel, rather than cycling through different algorithms. Experiment with kernel types and parameters like bandwidth ℓ and ridge λ, as these choices significantly impact predictions. This approach streamlines model selection and provides a unified interpretability framework.
Key insights
Most machine learning methods are variations of a kernel machine, with geometry (the kernel) as the primary choice.
Principles
- A kernel defines data point similarity and geometry.
- Positive-semidefinite kernels ensure linear system invertibility.
- Regularization penalizes function "roughness" as measured by the kernel.
Method
Minimize squared error plus a roughness penalty, then apply the representer theorem to solve a linear system for coefficients α = (K + λI)⁻¹y.
In practice
- Use `rbf_gram` and `krr_alpha` functions.
- Explore kernel bandwidth ℓ and ridge λ impact.
- Compare KRR, GP, NW, Attention, SVM predictions.
Topics
- Kernel Methods
- Machine Learning Geometry
- Kernel Ridge Regression
- Gaussian Processes
- Attention Mechanisms
- Support Vector Machines
Code references
Best for: Machine Learning Engineer, AI Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.