Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

2025-12-31 · Source: JMLR · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A theoretical study by Jiao, Lai, Wang, and Yan, published in 2026, demonstrates that Transformer models can overcome the "curse of dimensionality" when approximating H{"o}lder continuous function classes. The research constructs specific Transformer architectures, comprising one self-attention layer with a single head and softmax activation, combined with multiple feedforward layers. For an approximation accuracy of $\epsilon$, if ReLU and floor are used as activation functions in the feedforward layers, the model requires $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers with widths not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. The study also shows that using other activation functions can reduce feedforward layer width to a constant. This construction leverages the Kolmogorov-Arnold Superposition Theorem, offering a more intuitive proof than prior Transformer approximation works and introducing a translation technique to apply existing feedforward network approximation results to Transformers.

Key takeaway

For AI Researchers focused on the theoretical underpinnings of neural networks, this study provides critical insights into Transformer capabilities. You should consider these architectural and activation function choices when designing models for high-dimensional function approximation, as they offer a path to mitigate the curse of dimensionality. This work suggests that simpler Transformer configurations can achieve robust approximation guarantees.

Key insights

Transformers can overcome the curse of dimensionality, demonstrating strong expressive capabilities for function approximation.

Principles

Transformers can approximate H{"o}lder continuous functions.
Kolmogorov-Arnold Superposition Theorem aids Transformer analysis.

Method

The method constructs Transformers with one self-attention layer and multiple feedforward layers, using specific activation functions (ReLU, floor, or others) to achieve approximation accuracy $\epsilon$ with controlled layer depths and widths.

In practice

Design Transformers with single self-attention head for efficiency.
Consider ReLU and floor activations for specific approximation bounds.

Topics

Transformers
Curse of Dimensionality
Approximation Theory
Kolmogorov-Arnold Theorem
Neural Network Expressivity

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by JMLR.