Taming Curvature: Architecture Warm-Up for Stable Transformer Training
Summary
A new research paper published on 2026-06-15 addresses the brittleness and instability often encountered when training billion-parameter Transformers, which frequently leads to transient loss spikes and divergence. The authors introduce a fast online estimator for the largest preconditioned Hessian eigenvalue, or curvature, utilizing a warm-started power iteration with Hessian-vector products. This method is theoretically proven and empirically verified to enable feasible and accurate per-iteration curvature tracking at billion-parameter scales. Using this tool, the research identifies that training instabilities correlate with surges in preconditioned curvature, which also increases with network depth. Based on these findings, the paper proposes "architecture warm-up," a technique that progressively increases network depth to carefully manage the preconditioned Hessian and enhance training stability. Experiments confirm that this approach effectively tracks curvature and reduces instabilities compared to current state-of-the-art stabilization methods, without hindering convergence speed.
Key takeaway
For Machine Learning Engineers training large-scale Transformers, you should consider implementing architecture warm-up to mitigate common training instabilities. By progressively increasing network depth, you can carefully control the preconditioned Hessian, preventing loss spikes and divergence that waste significant compute resources. This approach offers a robust stabilization technique without compromising convergence speed, allowing you to achieve more reliable and efficient training outcomes for billion-parameter models.
Key insights
Controlling preconditioned curvature via architecture warm-up stabilizes billion-parameter Transformer training and enables efficient tracking.
Principles
- Training instabilities coincide with curvature surges.
- Curvature grows with Transformer network depth.
- Edge of Stability theory explains optimization stability.
Method
Propose architecture warm-up: progressively grow network depth to control preconditioned Hessian and stabilize Transformer training. Use a fast online estimator for largest preconditioned Hessian eigenvalue.
In practice
- Track curvature during large Transformer training.
- Apply architecture warm-up for stability.
- Reduce compute waste from training divergence.
Topics
- Transformers
- Training Stability
- Curvature Estimation
- Hessian Eigenvalue
- Architecture Warm-Up
- Edge of Stability Theory
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.