StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

StoSignSGD is a new optimization algorithm designed to address the non-convergence issues of SignSGD, particularly on non-smooth objectives common in modern machine learning, such as those found in large language models (LLMs). It injects structural stochasticity into the sign operator while maintaining an unbiased update step. Theoretically, StoSignSGD achieves a sharp convergence rate matching the lower bound in convex optimization and improves complexity bounds by dimensional factors in non-convex non-smooth optimization. Empirically, it demonstrates robust stability and superior efficiency, especially in low-precision FP8 pretraining where AdamW fails, yielding a 1.44x to 2.14x speedup. Additionally, StoSignSGD provides substantial performance gains when fine-tuning 7B LLMs on mathematical reasoning tasks compared to AdamW and SignSGD. The authors also developed a sign conversion framework to analyze its core components.

Key takeaway

For AI Engineers training large language models, especially in low-precision settings or with non-smooth objectives, StoSignSGD offers a robust and efficient alternative to AdamW and SignSGD. You should consider integrating StoSignSGD into your training pipelines, particularly for FP8 pretraining where it provides significant speedups and stability, or for fine-tuning on tasks like mathematical reasoning to achieve performance gains.

Key insights

StoSignSGD resolves SignSGD's non-convergence on non-smooth objectives by injecting unbiased structural stochasticity.

Principles

Method

StoSignSGD injects structural stochasticity into the sign operator while maintaining an unbiased update step, then applies this to gradient-based optimization.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.