Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

2026-06-18 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Mixed-Precision Communication-Avoiding SGD (CA-SGD) for Generalized Linear Models on GPUs overcomes communication bottlenecks in distributed SGD by amortizing communication over s iterations. This method replaces s consecutive AllReduces with a single AllReduce of an sb×sb Gram matrix, leveraging modern GPUs' matrix hardware and reduced-precision formats like BF16 to accelerate Gram GEMM and shrink traffic. A finite-precision analysis decomposed local rounding error into nine independent precision choices, leading to "Recipe C". This recipe stores the input matrix and margin vector in low precision (BF16), computes the Gram matrix with high-precision accumulation (FP32), communicates it in high precision (FP32), and performs inner recurrence and weight updates in high precision (FP32). On NERSC Perlmutter A100 GPUs, "Recipe C" matched FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems, achieving 5.1–6.8× speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets. A variant, "Recipe D", reached up to 6.8× speedup but with a larger relative loss gap.

Key takeaway

For Machine Learning Engineers optimizing distributed Generalized Linear Models on NVIDIA A100 GPUs, adopting mixed-precision Communication-Avoiding SGD (CA-SGD) can yield substantial speedups. You should implement "Recipe C" by storing inputs in BF16 and using FP32 accumulation for critical operations like Gram GEMM. This approach matches FP32 SGD accuracy within 0.5% while achieving 5.2–6.2× speedup, significantly reducing training time. Consider "Recipe D" for higher speedups on large-n datasets, but verify its accuracy for your specific problem.

Key insights

Mixed-precision CA-SGD significantly accelerates distributed GLM training on GPUs by optimizing communication and computation.

Principles

Communication, not computation, limits distributed SGD.
Amortize communication over multiple iterations.
Precision choices must align with error sensitivities.

Method

The proposed "Recipe C" for mixed-precision CA-SGD involves BF16 storage for inputs/margins, BF16-input/FP32-accumulate for Gram GEMM, and FP32 for inner correction, residuals, master weights, and Gram AllReduce.

In practice

Use BF16 for input/margin storage on A100 GPUs.
Employ FP32 accumulation for Gram GEMM with BF16 inputs.
Keep inner correction and master weights in FP32.

Topics

Communication-Avoiding SGD
Mixed Precision Training
Generalized Linear Models
NVIDIA A100 GPUs
Distributed Optimization
Finite-Precision Analysis

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.