Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
Summary
This research analyzes the large-step training dynamics of a simplified two-factor linear transformer model, revealing that finite learning rates can fundamentally alter training attractors beyond merely accelerating convergence. The study reduces gradient descent to a two-dimensional product map with an effective step-size parameter, μ. For 0 < μ < 2, an explicit invariant Chebyshev ellipse separates forward-invariant regions, carrying off-balanced chaotic dynamics while balanced scalar attractors can be transversely attracting. Key findings include specific learning-rate thresholds for monotone convergence (μ ≤ 2√2 - 2), catapult convergence (2√2 - 2 < μ ≤ 1), and the emergence of period-two cycles (1 < μ < √5 - 1), bounded chaos, or divergence (μ > 2). The work also extends to mini-batch gradient descent, showing it acts as random switching between maps, where atypical batches can push iterates across the full-batch separatrix.
Key takeaway
For machine learning engineers optimizing transformer models, recognize that large learning rates are not just faster gradient flow; they can fundamentally change your model's training outcome. If you observe non-convergent or oscillatory loss, your effective step-size μ may have crossed a stability threshold, leading to periodic or chaotic attractors instead of a single zero-error solution. Consider decaying your learning rate below μ=1 after warmup to ensure convergence, and balance attention factors to maintain stability.
Key insights
Large learning rates can shift transformer training from convergence to cycles or chaos, not just accelerate it.
Principles
- Finite-step GD can change attractors, not just accelerate convergence.
- Factor imbalance reduces stable learning rate thresholds.
- Mini-batching acts as random switching between dynamical maps.
Method
Gradient descent on a one-prompt linear self-attention objective reduces to a two-dimensional product map Φμ(a,b)=(a-(ab-μ)b, b-(ab-μ)a), where μ is the effective step-size.
In practice
- Use warmup to keep μ below unstable thresholds.
- Implement qk-layernorm or weight-tying to balance factors.
- Monitor mode-wise products, not just average loss.
Topics
- Transformer Training
- Gradient Descent Dynamics
- Learning Rate Stability
- Chaotic Systems
- Mini-batch SGD
- In-Context Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.