StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Summary
StoSignSGD is a new optimization algorithm designed to address the non-convergence issues of SignSGD, particularly on non-smooth objectives common in modern machine learning, such as those found in large language models (LLMs). It injects structural stochasticity into the sign operator while maintaining an unbiased update step. Theoretically, StoSignSGD achieves a sharp convergence rate matching the lower bound in convex optimization and improves complexity bounds by dimensional factors in non-convex non-smooth optimization. Empirically, it demonstrates robust stability and superior efficiency, especially in low-precision FP8 pretraining where AdamW fails, yielding a 1.44x to 2.14x speedup. Additionally, StoSignSGD provides substantial performance gains when fine-tuning 7B LLMs on mathematical reasoning tasks compared to AdamW and SignSGD. The authors also developed a sign conversion framework to analyze its core components.
Key takeaway
For AI Engineers training large language models, especially in low-precision settings or with non-smooth objectives, StoSignSGD offers a robust and efficient alternative to AdamW and SignSGD. You should consider integrating StoSignSGD into your training pipelines, particularly for FP8 pretraining where it provides significant speedups and stability, or for fine-tuning on tasks like mathematical reasoning to achieve performance gains.
Key insights
StoSignSGD resolves SignSGD's non-convergence on non-smooth objectives by injecting unbiased structural stochasticity.
Principles
- Unbiased stochasticity improves sign-based optimization.
- Non-smooth objectives require specialized optimizers.
Method
StoSignSGD injects structural stochasticity into the sign operator while maintaining an unbiased update step, then applies this to gradient-based optimization.
In practice
- Use StoSignSGD for low-precision FP8 LLM pretraining.
- Apply StoSignSGD to fine-tune 7B LLMs on math tasks.
Topics
- StoSignSGD
- SignSGD
- Large Language Models
- Non-smooth Optimization
- Low-Precision Training
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.