Self-Attention as a Kernel Machine: The Geometry of Objects and Relations

· Source: Agus’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Self-attention mechanisms in Transformers, typically explained as a "soft-lookup" based on query-key similarity, are reinterpreted as a kernel machine. This perspective reveals that the core matrix M, derived from query and key projections (M=WQ⊤WK), defines a generalized exponential kernel K(xi,xj)=exp⁡(xi⊤Mxj/√dk). Crucially, this kernel is not necessarily symmetric, allowing K(xᵢ, xⱼ) to differ from K(xⱼ, xᵢ). The matrix M can be uniquely decomposed into a symmetric component (S) and an antisymmetric component (A). The symmetric part, S, captures the "geometry of objects" by measuring reciprocal similarity and defining Mahalanobis distances, connecting attention to classical kernel theory and Reproducing Kernel Hilbert Spaces (RKHS). The antisymmetric part, A, captures the "geometry of relations" by encoding directional, non-reciprocal structures, such as subject-verb relationships or temporal causality. This decomposition shows that standard attention, by default, allocates roughly half its representational capacity to directional structure, even when a task might primarily require symmetric relationships, suggesting potential for more efficient and interpretable models through structured attention kernels.

Key takeaway

For AI Scientists and Research Scientists designing or optimizing Transformer models, understanding the symmetric and antisymmetric components of the attention kernel is critical. Your models are inherently learning both object similarity and directional relationships. If your task primarily involves reciprocal similarity, imposing a symmetry constraint on the attention kernel (M=S ⪰ 0) can improve performance, especially with less data, by reallocating wasted capacity and leading to more interpretable representations. Conversely, leverage the antisymmetric component for tasks requiring explicit modeling of non-reciprocal structures like causality or directed graphs.

Key insights

Self-attention is a kernel machine with symmetric and antisymmetric components capturing object geometry and directional relations.

Principles

Method

Reinterpret self-attention through bilinear forms to decompose the core matrix M into symmetric (S) and antisymmetric (A) components, revealing two distinct geometric learning modes.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.