Learning to Recall with Transformers Beyond Orthogonal Embeddings

2026-03-19 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A new analysis investigates the storage capacity of single-layer Transformers, moving beyond idealized assumptions of infinite data or orthogonal embeddings. This research, submitted on March 16, 2026, focuses on models trained with empirical gradient descent on finite datasets using non-orthogonal (random) embeddings. The study specifically examines a token-retrieval task where the Transformer must identify an informative token within a sequence of length L$L$ and map it to a label. By tracking the "early phase" of gradient descent, the authors derive explicit formulas for the model's storage capacity, revealing a multiplicative dependence on sample size N$N$, embedding dimension d$d$, and sequence length L$L$. Numerical validations support these scalings, which are further complemented by a lower bound for the underlying statistical problem, indicating this multiplicative scaling is intrinsic under non-orthogonal embeddings.

Key takeaway

For AI Researchers and Machine Learning Engineers designing or evaluating Transformer architectures, understanding the multiplicative scaling of storage capacity with sample size, embedding dimension, and sequence length under non-orthogonal embeddings is crucial. This insight suggests that optimizing these parameters together can significantly influence a model's ability to store and retrieve knowledge, moving beyond idealized theoretical limits. Consider these interdependencies when configuring training datasets and model dimensions to maximize factual recall.

Key insights

Transformer storage capacity scales multiplicatively with sample size, embedding dimension, and sequence length under random embeddings.

Principles

Non-orthogonal embeddings are intrinsic to realistic Transformer training.
Storage capacity depends on N$N$, d$d$, and L$L$ multiplicatively.

Method

Analyzed a single-layer Transformer with random embeddings trained via gradient descent on a token-retrieval task, tracking the "early phase" to derive capacity formulas.

In practice

Consider N$N$, d$d$, L$L$ for Transformer knowledge storage.
Random embeddings impact capacity scaling.

Topics

Transformers
Large Language Models
Storage Capacity
Gradient Descent
Non-Orthogonal Embeddings

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.