Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy
Summary
Spectra is a novel optimizer designed for Large Language Models (LLMs) that addresses the issue of highly anisotropic gradient signals during training. It identifies that recurrent linguistic structures concentrate gradient energy into a small "spike" subspace, comprising about 1.5% of directions, which dominates optimizer statistics and suppresses learning of context-specific, long-tail information. Spectra mitigates this by tracking the dominant low-rank spike subspace using cached, warm-started power iteration and applying low-rank spectral shaping. This approach suppresses the spike without amplifying noise-sensitive spectral tails, leading to improved training efficiency and performance. On LLaMA3-8B trained on 50B tokens, Spectra achieves the same target loss 30% faster than AdamW, reduces optimizer-state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time and improves average accuracy by 0.66%.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM training, adopting Spectra can significantly accelerate convergence and enhance model performance. Its targeted spectral shaping addresses gradient anisotropy more efficiently than traditional optimizers like AdamW and Muon, reducing training time by 30% and cutting optimizer memory by nearly half. Consider integrating Spectra to achieve faster, more stable training and better downstream accuracy, especially when dealing with large-scale LLMs and diverse linguistic data.
Key insights
LLM gradients exhibit a "spike-tail" anisotropy where dominant linguistic structures suppress long-tail semantic learning.
Principles
- Gradient anisotropy is consistent across LLM scales and training stages.
- Spike-dominated second-moment accumulation contracts tail updates.
- Smaller singular directions encode sparser semantics and higher relative variance.
Method
Spectra tracks the low-rank spike subspace via warm-started power iteration, then applies singular-value shrinking to the spike while leaving the tail unchanged, avoiding dense second-order statistics.
In practice
- Use a rank ratio of 1.5% for spike compression.
- Employ a single cached power-iteration step for efficiency.
- Spectra offers improved learning rate robustness compared to AdamW.
Topics
- LLM Optimizers
- Gradient Anisotropy
- Spectral Shaping
- Low-Rank Approximation
- Optimization Efficiency
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.