The Kernel as the Normal Form

2026-01-11 · Source: Agus’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Five prominent machine learning methods—Support Vector Machine, Gradient Boosting Machine, Gaussian Process posterior mean, Kernel Ridge Regression, Nadaraya–Watson smoothing, and a single attention head—are presented as variations of a single "kernel machine." The article argues that these methods, when applied to tabular data, fundamentally differ only in their underlying "geometry," defined by the kernel, rather than distinct predictive ideas. The core mechanism involves fixing a notion of similarity between data points, forming a Gram matrix of pairwise similarities, and solving a linear system. For example, Kernel Ridge Regression (KRR) is algebraically identical to the Gaussian Process posterior mean, both yielding \$338k on California housing data. A single attention head is shown to be equivalent to Nadaraya–Watson smoothing, both returning \$215k. The Support Vector Machine, using hinge loss, produced \$321k, leveraging 3,139 of 4,000 points as support vectors. This reframing suggests that selecting a "method" is largely a choice of geometry, with the kernel being the critical determinant.

Key takeaway

For Machine Learning Engineers selecting models for tabular data, recognize that many common methods are fundamentally kernel machines. Your primary decision should focus on defining the data's geometry via the kernel, rather than cycling through different algorithms. Experiment with kernel types and parameters like bandwidth ℓ and ridge λ, as these choices significantly impact predictions. This approach streamlines model selection and provides a unified interpretability framework.

Key insights

Most machine learning methods are variations of a kernel machine, with geometry (the kernel) as the primary choice.

Principles

A kernel defines data point similarity and geometry.
Positive-semidefinite kernels ensure linear system invertibility.
Regularization penalizes function "roughness" as measured by the kernel.

Method

Minimize squared error plus a roughness penalty, then apply the representer theorem to solve a linear system for coefficients α = (K + λI)⁻¹y.

In practice

Use `rbf_gram` and `krr_alpha` functions.
Explore kernel bandwidth ℓ and ridge λ impact.
Compare KRR, GP, NW, Attention, SVM predictions.

Topics

Kernel Methods
Machine Learning Geometry
Kernel Ridge Regression
Gaussian Processes
Attention Mechanisms
Support Vector Machines

Code references

asudjianto-xml/Learned-Kernel

Best for: Machine Learning Engineer, AI Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.