Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

Spectra is a novel optimizer designed for Large Language Models (LLMs) that addresses the issue of highly anisotropic gradient signals during training. It identifies that recurrent linguistic structures concentrate gradient energy into a small "spike" subspace, comprising about 1.5% of directions, which dominates optimizer statistics and suppresses learning of context-specific, long-tail information. Spectra mitigates this by tracking the dominant low-rank spike subspace using cached, warm-started power iteration and applying low-rank spectral shaping. This approach suppresses the spike without amplifying noise-sensitive spectral tails, leading to improved training efficiency and performance. On LLaMA3-8B trained on 50B tokens, Spectra achieves the same target loss 30% faster than AdamW, reduces optimizer-state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time and improves average accuracy by 0.66%.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM training, adopting Spectra can significantly accelerate convergence and enhance model performance. Its targeted spectral shaping addresses gradient anisotropy more efficiently than traditional optimizers like AdamW and Muon, reducing training time by 30% and cutting optimizer memory by nearly half. Consider integrating Spectra to achieve faster, more stable training and better downstream accuracy, especially when dealing with large-scale LLMs and diverse linguistic data.

Key insights

LLM gradients exhibit a "spike-tail" anisotropy where dominant linguistic structures suppress long-tail semantic learning.

Principles

Method

Spectra tracks the low-rank spike subspace via warm-started power iteration, then applies singular-value shrinking to the spike while leaving the tail unchanged, avoiding dense second-order statistics.

In practice

Topics

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.