Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

2026-05-21 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This research analyzes the large-step training dynamics of a simplified two-factor linear transformer model, revealing that finite learning rates can fundamentally alter training attractors beyond merely accelerating convergence. The study reduces gradient descent to a two-dimensional product map with an effective step-size parameter, μ. For 0 < μ < 2, an explicit invariant Chebyshev ellipse separates forward-invariant regions, carrying off-balanced chaotic dynamics while balanced scalar attractors can be transversely attracting. Key findings include specific learning-rate thresholds for monotone convergence (μ ≤ 2√2 - 2), catapult convergence (2√2 - 2 < μ ≤ 1), and the emergence of period-two cycles (1 < μ < √5 - 1), bounded chaos, or divergence (μ > 2). The work also extends to mini-batch gradient descent, showing it acts as random switching between maps, where atypical batches can push iterates across the full-batch separatrix.

Key takeaway

For machine learning engineers optimizing transformer models, recognize that large learning rates are not just faster gradient flow; they can fundamentally change your model's training outcome. If you observe non-convergent or oscillatory loss, your effective step-size μ may have crossed a stability threshold, leading to periodic or chaotic attractors instead of a single zero-error solution. Consider decaying your learning rate below μ=1 after warmup to ensure convergence, and balance attention factors to maintain stability.

Key insights

Large learning rates can shift transformer training from convergence to cycles or chaos, not just accelerate it.

Principles

Finite-step GD can change attractors, not just accelerate convergence.
Factor imbalance reduces stable learning rate thresholds.
Mini-batching acts as random switching between dynamical maps.

Method

Gradient descent on a one-prompt linear self-attention objective reduces to a two-dimensional product map Φμ(a,b)=(a-(ab-μ)b, b-(ab-μ)a), where μ is the effective step-size.

In practice

Use warmup to keep μ below unstable thresholds.
Implement qk-layernorm or weight-tying to balance factors.
Monitor mode-wise products, not just average loss.

Topics

Transformer Training
Gradient Descent Dynamics
Learning Rate Stability
Chaotic Systems
Mini-batch SGD
In-Context Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.