Training Time Prediction for Mixed Precision-based Distributed Training
Summary
A new precision-aware distributed training time predictor has been developed to improve resource allocation and cost estimation for deep learning workloads. Existing prediction methods, which rely on static model computation graphs, fail to account for variations introduced by different floating-point precision settings, including mixed precision. This oversight can lead to substantial prediction errors, with mean absolute percentage error (MAPE) reaching up to 147.85%. The proposed predictor addresses this limitation by incorporating precision awareness, achieving a robust accuracy of 9.8% MAPE across various precision configurations. This advancement is critical given that floating-point precision can cause training time variations of approximately 2.4x.
Key takeaway
For MLOps Engineers optimizing resource allocation and job scheduling, you should integrate precision-aware training time predictors into your workflows. Relying on static model computation graphs without considering floating-point precision can lead to significant cost overruns and inefficient resource utilization, given that precision settings can alter training times by up to 2.4x.
Key insights
Precision settings significantly impact distributed deep learning training times, necessitating precision-aware prediction models.
Principles
- Precision variations cause ~2.4x training time differences.
- Static computation graphs yield high prediction errors.
Method
The proposed method introduces a precision-aware distributed training time predictor, designed to capture the impact of floating-point precision settings, including mixed precision, on training duration.
In practice
- Use precision-aware predictors for cost estimation.
- Integrate precision into job scheduling algorithms.
Topics
- Training Time Prediction
- Distributed Training
- Mixed Precision
- Deep Learning
- Resource Allocation
Best for: MLOps Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.