Ordinary Least Squares is a Special Case of Transformer

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, long

Summary

A new study rigorously proves that Ordinary Least Squares (OLS) regression is a special case of a single-layer Linear Transformer under specific parameter configurations. Researchers Xiaojun Tan and Yuchen Zhao demonstrate this through algebraic proof, showing that the attention mechanism's forward pass can directly compute the OLS closed-form projection in one step, rather than through iterative approximation. This "OLS-Transformer" model reveals a decoupled slow and fast memory mechanism, where weight matrices act as slow memory for long-term patterns and attention scores function as fast memory for real-time contextual associations. The work also traces the evolution from this linear prototype to standard Transformers, highlighting how the transition from linear projection to Softmax attention significantly boosts Hopfield energy function memory capacity from linear to exponential scale.

Key takeaway

For research scientists developing next-generation AI architectures, this work fundamentally redefines the Transformer's nature from a "black-box" approximator to a statistical operator. You should consider how this algebraic foundation for context-aware reasoning and memory capacity, rooted in basic statistical operations like OLS, can inform the design of more robust and interpretable models. Explore extensions to higher-order polynomial or exponential energy functions to optimize efficiency and memory density.

Key insights

OLS regression is a special case of a single-layer Linear Transformer, solvable in one forward pass.

Principles

Transformers are statistical operators, not just approximators.
Memory in Transformers decouples into slow (weights) and fast (attention scores).
Softmax attention exponentially increases memory capacity over linear attention.

Method

The OLS solution $\hat{\mathbf{Y}}=\mathbf{X}(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{Y}$ is mapped to a Linear Transformer's forward pass by setting $\mathbf{W}_{\text{Q}}=\mathbf{W}_{\text{K}}=\mathbf{W}_{\text{V}}=\mathbf{L}$, $\mathbf{W}_{\text{FFN}}=\mathbf{I}$, and $\mathbf{W}_{\text{P}}=\mathbf{P}$, where $\mathbf{L}=\mathbf{V}\boldsymbol{\Lambda}^{-1/2}$ from the empirical covariance matrix decomposition.

In practice

Design novel operators beyond linear attention.
Balance computational efficiency with memory density.
Enhance model interpretability through algebraic foundations.

Topics

Ordinary Least Squares
Linear Transformer
Attention Mechanism
Hopfield Networks
Associative Memory

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.