Learning to Recall with Transformers Beyond Orthogonal Embeddings
Summary
A new analysis investigates the storage capacity of single-layer Transformers, moving beyond idealized assumptions of infinite data or orthogonal embeddings. This research, submitted on March 16, 2026, focuses on models trained with empirical gradient descent on finite datasets using non-orthogonal (random) embeddings. The study specifically examines a token-retrieval task where the Transformer must identify an informative token within a sequence of length L$L$ and map it to a label. By tracking the "early phase" of gradient descent, the authors derive explicit formulas for the model's storage capacity, revealing a multiplicative dependence on sample size N$N$, embedding dimension d$d$, and sequence length L$L$. Numerical validations support these scalings, which are further complemented by a lower bound for the underlying statistical problem, indicating this multiplicative scaling is intrinsic under non-orthogonal embeddings.
Key takeaway
For AI Researchers and Machine Learning Engineers designing or evaluating Transformer architectures, understanding the multiplicative scaling of storage capacity with sample size, embedding dimension, and sequence length under non-orthogonal embeddings is crucial. This insight suggests that optimizing these parameters together can significantly influence a model's ability to store and retrieve knowledge, moving beyond idealized theoretical limits. Consider these interdependencies when configuring training datasets and model dimensions to maximize factual recall.
Key insights
Transformer storage capacity scales multiplicatively with sample size, embedding dimension, and sequence length under random embeddings.
Principles
- Non-orthogonal embeddings are intrinsic to realistic Transformer training.
- Storage capacity depends on N$N$, d$d$, and L$L$ multiplicatively.
Method
Analyzed a single-layer Transformer with random embeddings trained via gradient descent on a token-retrieval task, tracking the "early phase" to derive capacity formulas.
In practice
- Consider N$N$, d$d$, L$L$ for Transformer knowledge storage.
- Random embeddings impact capacity scaling.
Topics
- Transformers
- Large Language Models
- Storage Capacity
- Gradient Descent
- Non-Orthogonal Embeddings
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.