[D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included)
Summary
An anonymous paper, "The d^2 Pullback Theorem," claims the intrinsic optimization landscape of the Attention mechanism is fundamentally d^2-dimensional, not n^2, arguing that the O(n^2) bottleneck is an "illusion" caused by softmax normalization. The author proposes replacing softmax with a degree-2 polynomial kernel (x^2) in a "Centered Shifted-Quadratic (CSQ) Attention" model, asserting this retains Euclidean matching, stabilizes training, and reduces both training and inference complexity to O(nd^3). While the mathematical derivation of the d^2 dimensionality is acknowledged as potentially novel, community discussion expresses skepticism regarding the practical benefits of O(nd^3) over O(n^2d) for current model sizes. Critics question whether polynomial attention is truly functionally equivalent to softmax-based attention, suggesting the d^2 limit arises from parameterization rather than the kernel choice itself. The paper's provenance from an anonymous forum also raises concerns, despite calls for evaluation based purely on mathematical merit.
Key takeaway
A new "d^2 Pullback Theorem" mathematically proves Attention's true optimization landscape is d^2-dimensional, not n^2, arguing the O(n^2) bottleneck is a softmax-induced illusion. By proposing CSQ Attention with a degree-2 polynomial kernel, it achieves O(nd^3) complexity for both training and inference, maintaining Euclidean matching. This offers a theoretical foundation for scalable Transformers, but its practical superiority over O(n^2d) depends on 'n' significantly exceeding 'd^2' and requires expert validation.
Topics
- Attention Mechanism
- Computational Complexity
- Transformer Architectures
- Softmax Normalization
- Polynomial Kernels
Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.