[D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included)

2026-03-05 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

An anonymous paper, "The d^2 Pullback Theorem," claims the intrinsic optimization landscape of the Attention mechanism is fundamentally d^2-dimensional, not n^2, arguing that the O(n^2) bottleneck is an "illusion" caused by softmax normalization. The author proposes replacing softmax with a degree-2 polynomial kernel (x^2) in a "Centered Shifted-Quadratic (CSQ) Attention" model, asserting this retains Euclidean matching, stabilizes training, and reduces both training and inference complexity to O(nd^3). While the mathematical derivation of the d^2 dimensionality is acknowledged as potentially novel, community discussion expresses skepticism regarding the practical benefits of O(nd^3) over O(n^2d) for current model sizes. Critics question whether polynomial attention is truly functionally equivalent to softmax-based attention, suggesting the d^2 limit arises from parameterization rather than the kernel choice itself. The paper's provenance from an anonymous forum also raises concerns, despite calls for evaluation based purely on mathematical merit.

Key takeaway

A new "d^2 Pullback Theorem" mathematically proves Attention's true optimization landscape is d^2-dimensional, not n^2, arguing the O(n^2) bottleneck is a softmax-induced illusion. By proposing CSQ Attention with a degree-2 polynomial kernel, it achieves O(nd^3) complexity for both training and inference, maintaining Euclidean matching. This offers a theoretical foundation for scalable Transformers, but its practical superiority over O(n^2d) depends on 'n' significantly exceeding 'd^2' and requires expert validation.

Topics

Attention Mechanism
Computational Complexity
Transformer Architectures
Softmax Normalization
Polynomial Kernels

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.