Ordinary Least Squares is a Special Case of Transformer
Summary
A new study rigorously proves that Ordinary Least Squares (OLS) regression is a special case of a single-layer Linear Transformer under specific parameter configurations. Researchers Xiaojun Tan and Yuchen Zhao demonstrate this through algebraic proof, showing that the attention mechanism's forward pass can directly compute the OLS closed-form projection in one step, rather than through iterative approximation. This "OLS-Transformer" model reveals a decoupled slow and fast memory mechanism, where weight matrices act as slow memory for long-term patterns and attention scores function as fast memory for real-time contextual associations. The work also traces the evolution from this linear prototype to standard Transformers, highlighting how the transition from linear projection to Softmax attention significantly boosts Hopfield energy function memory capacity from linear to exponential scale.
Key takeaway
For research scientists developing next-generation AI architectures, this work fundamentally redefines the Transformer's nature from a "black-box" approximator to a statistical operator. You should consider how this algebraic foundation for context-aware reasoning and memory capacity, rooted in basic statistical operations like OLS, can inform the design of more robust and interpretable models. Explore extensions to higher-order polynomial or exponential energy functions to optimize efficiency and memory density.
Key insights
OLS regression is a special case of a single-layer Linear Transformer, solvable in one forward pass.
Principles
- Transformers are statistical operators, not just approximators.
- Memory in Transformers decouples into slow (weights) and fast (attention scores).
- Softmax attention exponentially increases memory capacity over linear attention.
Method
The OLS solution $\hat{\mathbf{Y}}=\mathbf{X}(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{Y}$ is mapped to a Linear Transformer's forward pass by setting $\mathbf{W}_{\text{Q}}=\mathbf{W}_{\text{K}}=\mathbf{W}_{\text{V}}=\mathbf{L}$, $\mathbf{W}_{\text{FFN}}=\mathbf{I}$, and $\mathbf{W}_{\text{P}}=\mathbf{P}$, where $\mathbf{L}=\mathbf{V}\boldsymbol{\Lambda}^{-1/2}$ from the empirical covariance matrix decomposition.
In practice
- Design novel operators beyond linear attention.
- Balance computational efficiency with memory density.
- Enhance model interpretability through algebraic foundations.
Topics
- Ordinary Least Squares
- Linear Transformer
- Attention Mechanism
- Hopfield Networks
- Associative Memory
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.