Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

2024-12-02 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

This work generalizes prior efforts in distributed-memory optimization by introducing HybridSGD, a 2D parallel Stochastic Gradient Descent method. HybridSGD integrates 1D s-step SGD and 1D Federated SGD with Averaging (FedAvg) to achieve a continuous performance tradeoff between these baseline algorithms. The authors provide theoretical analysis covering convergence, computation, communication, and memory tradeoffs. Implemented in C++ and MPI, and evaluated on a Cray EX supercomputing system, HybridSGD demonstrated superior empirical performance. It achieved better convergence than FedAvg at similar processor scales, delivering speedups of 5.3x over s-step SGD and up to 121x over FedAvg when solving binary classification tasks using logistic regression on LIBSVM datasets.

Key takeaway

For Machine Learning Engineers and AI Architects optimizing distributed SGD on supercomputing systems, HybridSGD provides a robust solution to communication bottlenecks. You should consider implementing this 2D parallel method, especially for convex optimization tasks like logistic regression, to achieve up to 5.3x speedups over s-step SGD and 121x over FedAvg. This approach allows finer control over communication-computation tradeoffs, enhancing scalability on modern distributed-memory hardware.

Key insights

HybridSGD combines 1D s-step SGD and FedAvg via a 2D processor grid for communication-efficient distributed optimization.

Principles

Communication cost limits distributed SGD scalability.
2D processor grids enable continuous performance tradeoffs.
Balancing convergence and runtime requires hyperparameter tuning.

Method

HybridSGD integrates FedAvg (row-wise) and s-step SGD (column-wise) on a 2D processor grid, performing s-step SGD calls within FedAvg iterations, with the condition s <= tau.

In practice

Implement 2D data partitioning for sparse matrices.
Tune 's' and 'tau' for optimal communication-computation balance.
Use C++ with MPI and Intel OneAPI for high-performance distributed SGD.

Topics

Distributed Optimization
Stochastic Gradient Descent
Communication Efficiency
Parallel Computing
Logistic Regression
Supercomputing Systems

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.