Self-Attention as a Kernel Machine: The Geometry of Objects and Relations
Summary
Self-attention mechanisms in Transformers, typically explained as a "soft-lookup" based on query-key similarity, are reinterpreted as a kernel machine. This perspective reveals that the core matrix M, derived from query and key projections (M=WQ⊤WK), defines a generalized exponential kernel K(xi,xj)=exp(xi⊤Mxj/√dk). Crucially, this kernel is not necessarily symmetric, allowing K(xᵢ, xⱼ) to differ from K(xⱼ, xᵢ). The matrix M can be uniquely decomposed into a symmetric component (S) and an antisymmetric component (A). The symmetric part, S, captures the "geometry of objects" by measuring reciprocal similarity and defining Mahalanobis distances, connecting attention to classical kernel theory and Reproducing Kernel Hilbert Spaces (RKHS). The antisymmetric part, A, captures the "geometry of relations" by encoding directional, non-reciprocal structures, such as subject-verb relationships or temporal causality. This decomposition shows that standard attention, by default, allocates roughly half its representational capacity to directional structure, even when a task might primarily require symmetric relationships, suggesting potential for more efficient and interpretable models through structured attention kernels.
Key takeaway
For AI Scientists and Research Scientists designing or optimizing Transformer models, understanding the symmetric and antisymmetric components of the attention kernel is critical. Your models are inherently learning both object similarity and directional relationships. If your task primarily involves reciprocal similarity, imposing a symmetry constraint on the attention kernel (M=S ⪰ 0) can improve performance, especially with less data, by reallocating wasted capacity and leading to more interpretable representations. Conversely, leverage the antisymmetric component for tasks requiring explicit modeling of non-reciprocal structures like causality or directed graphs.
Key insights
Self-attention is a kernel machine with symmetric and antisymmetric components capturing object geometry and directional relations.
Principles
- Attention kernels are not inherently symmetric.
- Any square matrix M decomposes into symmetric (S) and antisymmetric (A) parts.
- Standard attention allocates equal capacity to S and A at initialization.
Method
Reinterpret self-attention through bilinear forms to decompose the core matrix M into symmetric (S) and antisymmetric (A) components, revealing two distinct geometric learning modes.
In practice
- Constrain attention to symmetric kernels for tasks needing only reciprocal similarity.
- Utilize antisymmetric components for modeling directional relationships.
- Consider structured attention to recover wasted capacity.
Topics
- Self-Attention Kernels
- Inductive Bias
- Geometric Decomposition
- Transformer Architecture
- Relational Learning
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.