Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization
Summary
This work generalizes prior efforts in distributed-memory optimization by introducing HybridSGD, a 2D parallel Stochastic Gradient Descent method. HybridSGD integrates 1D s-step SGD and 1D Federated SGD with Averaging (FedAvg) to achieve a continuous performance tradeoff between these baseline algorithms. The authors provide theoretical analysis covering convergence, computation, communication, and memory tradeoffs. Implemented in C++ and MPI, and evaluated on a Cray EX supercomputing system, HybridSGD demonstrated superior empirical performance. It achieved better convergence than FedAvg at similar processor scales, delivering speedups of 5.3x over s-step SGD and up to 121x over FedAvg when solving binary classification tasks using logistic regression on LIBSVM datasets.
Key takeaway
For Machine Learning Engineers and AI Architects optimizing distributed SGD on supercomputing systems, HybridSGD provides a robust solution to communication bottlenecks. You should consider implementing this 2D parallel method, especially for convex optimization tasks like logistic regression, to achieve up to 5.3x speedups over s-step SGD and 121x over FedAvg. This approach allows finer control over communication-computation tradeoffs, enhancing scalability on modern distributed-memory hardware.
Key insights
HybridSGD combines 1D s-step SGD and FedAvg via a 2D processor grid for communication-efficient distributed optimization.
Principles
- Communication cost limits distributed SGD scalability.
- 2D processor grids enable continuous performance tradeoffs.
- Balancing convergence and runtime requires hyperparameter tuning.
Method
HybridSGD integrates FedAvg (row-wise) and s-step SGD (column-wise) on a 2D processor grid, performing s-step SGD calls within FedAvg iterations, with the condition s <= tau.
In practice
- Implement 2D data partitioning for sparse matrices.
- Tune 's' and 'tau' for optimal communication-computation balance.
- Use C++ with MPI and Intel OneAPI for high-performance distributed SGD.
Topics
- Distributed Optimization
- Stochastic Gradient Descent
- Communication Efficiency
- Parallel Computing
- Logistic Regression
- Supercomputing Systems
Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.