Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs
Summary
Mixed-Precision Communication-Avoiding SGD (CA-SGD) for Generalized Linear Models on GPUs overcomes communication bottlenecks in distributed SGD by amortizing communication over s iterations. This method replaces s consecutive AllReduces with a single AllReduce of an sb×sb Gram matrix, leveraging modern GPUs' matrix hardware and reduced-precision formats like BF16 to accelerate Gram GEMM and shrink traffic. A finite-precision analysis decomposed local rounding error into nine independent precision choices, leading to "Recipe C". This recipe stores the input matrix and margin vector in low precision (BF16), computes the Gram matrix with high-precision accumulation (FP32), communicates it in high precision (FP32), and performs inner recurrence and weight updates in high precision (FP32). On NERSC Perlmutter A100 GPUs, "Recipe C" matched FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems, achieving 5.1–6.8× speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets. A variant, "Recipe D", reached up to 6.8× speedup but with a larger relative loss gap.
Key takeaway
For Machine Learning Engineers optimizing distributed Generalized Linear Models on NVIDIA A100 GPUs, adopting mixed-precision Communication-Avoiding SGD (CA-SGD) can yield substantial speedups. You should implement "Recipe C" by storing inputs in BF16 and using FP32 accumulation for critical operations like Gram GEMM. This approach matches FP32 SGD accuracy within 0.5% while achieving 5.2–6.2× speedup, significantly reducing training time. Consider "Recipe D" for higher speedups on large-n datasets, but verify its accuracy for your specific problem.
Key insights
Mixed-precision CA-SGD significantly accelerates distributed GLM training on GPUs by optimizing communication and computation.
Principles
- Communication, not computation, limits distributed SGD.
- Amortize communication over multiple iterations.
- Precision choices must align with error sensitivities.
Method
The proposed "Recipe C" for mixed-precision CA-SGD involves BF16 storage for inputs/margins, BF16-input/FP32-accumulate for Gram GEMM, and FP32 for inner correction, residuals, master weights, and Gram AllReduce.
In practice
- Use BF16 for input/margin storage on A100 GPUs.
- Employ FP32 accumulation for Gram GEMM with BF16 inputs.
- Keep inner correction and master weights in FP32.
Topics
- Communication-Avoiding SGD
- Mixed Precision Training
- Generalized Linear Models
- NVIDIA A100 GPUs
- Distributed Optimization
- Finite-Precision Analysis
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.