StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

2026-04-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

StoSignSGD is a new optimization algorithm designed to address the non-convergence issues of SignSGD, particularly on non-smooth objectives common in modern machine learning, such as those found in large language models (LLMs). It injects structural stochasticity into the sign operator while maintaining an unbiased update step. Theoretically, StoSignSGD achieves a sharp convergence rate matching the lower bound in convex optimization and improves complexity bounds by dimensional factors in non-convex non-smooth optimization. Empirically, it demonstrates robust stability and superior efficiency, especially in low-precision FP8 pretraining where AdamW fails, yielding a 1.44x to 2.14x speedup. Additionally, StoSignSGD provides substantial performance gains when fine-tuning 7B LLMs on mathematical reasoning tasks compared to AdamW and SignSGD. The authors also developed a sign conversion framework to analyze its core components.

Key takeaway

For AI Engineers training large language models, especially in low-precision settings or with non-smooth objectives, StoSignSGD offers a robust and efficient alternative to AdamW and SignSGD. You should consider integrating StoSignSGD into your training pipelines, particularly for FP8 pretraining where it provides significant speedups and stability, or for fine-tuning on tasks like mathematical reasoning to achieve performance gains.

Key insights

StoSignSGD resolves SignSGD's non-convergence on non-smooth objectives by injecting unbiased structural stochasticity.

Principles

Unbiased stochasticity improves sign-based optimization.
Non-smooth objectives require specialized optimizers.

Method

StoSignSGD injects structural stochasticity into the sign operator while maintaining an unbiased update step, then applies this to gradient-based optimization.

In practice

Use StoSignSGD for low-precision FP8 LLM pretraining.
Apply StoSignSGD to fine-tune 7B LLMs on math tasks.

Topics

StoSignSGD
SignSGD
Large Language Models
Non-smooth Optimization
Low-Precision Training

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.