How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This research applies random matrix theory (RMT) to analyze how attention-based pooling affects signal recovery in high-dimensional sequence models. The study constructs sample covariance matrices from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table with positional correlations. Working in a high-dimensional regime where $d,V,N\to\infty$ with $d/V\to\delta$ and $d/N\to\gamma$, the authors derive exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko–Pastur law, $\kappa(\mathrm{MP}_{\delta}\boxtimes\mathrm{MP}_{\gamma})$, reflecting the finite vocabulary structure. Signal recovery undergoes two successive BBP-type phase transitions, characterized by scalars $\delta,\gamma,\alpha={\bm{w}}^{\top}{\bm{R}}{\bm{w}}$ and $\kappa=\|{\bm{w}}\|^{2}$. The analysis demonstrates that optimal attention weights maximizing the signal-to-noise ratio $\alpha/\kappa$ are given by the normalized top eigenvector of the positional correlation matrix ${\bm{R}}$. Parameter-free causal self-attention with $\tau/d$ score scaling yields deterministic harmonic weights that improve signal recovery over mean pooling when early tokens carry more signal. Extensive simulations confirm the theoretical predictions.

Key takeaway

For AI Scientists and Research Scientists developing or optimizing attention-based models, understanding the spectral properties of pooled representations is crucial. Your choice of attention weights directly impacts signal detectability and recovery, with optimal weights being the normalized top eigenvector of the positional correlation matrix ${\bm{R}}$. Maximizing the $\alpha/\kappa$ ratio lowers detection thresholds, suggesting that tailoring attention to sequence structure, rather than using uniform pooling, can significantly enhance model performance, especially in high-dimensional, finite-vocabulary settings.

Key insights

Random matrix theory reveals how attention weights modulate signal recovery and detectability in high-dimensional sequence models.

Principles

Method

The study uses random matrix theory to analyze spectral properties of sample covariance matrices from pooled sequence representations, deriving exact limiting eigenvalue distributions and BBP-type phase transitions for signal recovery.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.