Taming Curvature: Architecture Warm-Up for Stable Transformer Training

2026-06-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new research paper published on 2026-06-15 addresses the brittleness and instability often encountered when training billion-parameter Transformers, which frequently leads to transient loss spikes and divergence. The authors introduce a fast online estimator for the largest preconditioned Hessian eigenvalue, or curvature, utilizing a warm-started power iteration with Hessian-vector products. This method is theoretically proven and empirically verified to enable feasible and accurate per-iteration curvature tracking at billion-parameter scales. Using this tool, the research identifies that training instabilities correlate with surges in preconditioned curvature, which also increases with network depth. Based on these findings, the paper proposes "architecture warm-up," a technique that progressively increases network depth to carefully manage the preconditioned Hessian and enhance training stability. Experiments confirm that this approach effectively tracks curvature and reduces instabilities compared to current state-of-the-art stabilization methods, without hindering convergence speed.

Key takeaway

For Machine Learning Engineers training large-scale Transformers, you should consider implementing architecture warm-up to mitigate common training instabilities. By progressively increasing network depth, you can carefully control the preconditioned Hessian, preventing loss spikes and divergence that waste significant compute resources. This approach offers a robust stabilization technique without compromising convergence speed, allowing you to achieve more reliable and efficient training outcomes for billion-parameter models.

Key insights

Controlling preconditioned curvature via architecture warm-up stabilizes billion-parameter Transformer training and enables efficient tracking.

Principles

Training instabilities coincide with curvature surges.
Curvature grows with Transformer network depth.
Edge of Stability theory explains optimization stability.

Method

Propose architecture warm-up: progressively grow network depth to control preconditioned Hessian and stabilize Transformer training. Use a fast online estimator for largest preconditioned Hessian eigenvalue.

In practice

Track curvature during large Transformer training.
Apply architecture warm-up for stability.
Reduce compute waste from training divergence.

Topics

Transformers
Training Stability
Curvature Estimation
Hessian Eigenvalue
Architecture Warm-Up
Edge of Stability Theory

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.