Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Mixed-Precision Communication-Avoiding SGD (CA-SGD) for Generalized Linear Models on GPUs overcomes communication bottlenecks in distributed SGD by amortizing communication over s iterations. This method replaces s consecutive AllReduces with a single AllReduce of an sb×sb Gram matrix, leveraging modern GPUs' matrix hardware and reduced-precision formats like BF16 to accelerate Gram GEMM and shrink traffic. A finite-precision analysis decomposed local rounding error into nine independent precision choices, leading to "Recipe C". This recipe stores the input matrix and margin vector in low precision (BF16), computes the Gram matrix with high-precision accumulation (FP32), communicates it in high precision (FP32), and performs inner recurrence and weight updates in high precision (FP32). On NERSC Perlmutter A100 GPUs, "Recipe C" matched FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems, achieving 5.1–6.8× speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets. A variant, "Recipe D", reached up to 6.8× speedup but with a larger relative loss gap.

Key takeaway

For Machine Learning Engineers optimizing distributed Generalized Linear Models on NVIDIA A100 GPUs, adopting mixed-precision Communication-Avoiding SGD (CA-SGD) can yield substantial speedups. You should implement "Recipe C" by storing inputs in BF16 and using FP32 accumulation for critical operations like Gram GEMM. This approach matches FP32 SGD accuracy within 0.5% while achieving 5.2–6.2× speedup, significantly reducing training time. Consider "Recipe D" for higher speedups on large-n datasets, but verify its accuracy for your specific problem.

Key insights

Mixed-precision CA-SGD significantly accelerates distributed GLM training on GPUs by optimizing communication and computation.

Principles

Method

The proposed "Recipe C" for mixed-precision CA-SGD involves BF16 storage for inputs/margins, BF16-input/FP32-accumulate for Gram GEMM, and FP32 for inner correction, residuals, master weights, and Gram AllReduce.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.