Training Time Prediction for Mixed Precision-based Distributed Training

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new precision-aware distributed training time predictor has been developed to improve resource allocation and cost estimation for deep learning workloads. Existing prediction methods, which rely on static model computation graphs, fail to account for variations introduced by different floating-point precision settings, including mixed precision. This oversight can lead to substantial prediction errors, with mean absolute percentage error (MAPE) reaching up to 147.85%. The proposed predictor addresses this limitation by incorporating precision awareness, achieving a robust accuracy of 9.8% MAPE across various precision configurations. This advancement is critical given that floating-point precision can cause training time variations of approximately 2.4x.

Key takeaway

For MLOps Engineers optimizing resource allocation and job scheduling, you should integrate precision-aware training time predictors into your workflows. Relying on static model computation graphs without considering floating-point precision can lead to significant cost overruns and inefficient resource utilization, given that precision settings can alter training times by up to 2.4x.

Key insights

Precision settings significantly impact distributed deep learning training times, necessitating precision-aware prediction models.

Principles

Method

The proposed method introduces a precision-aware distributed training time predictor, designed to capture the impact of floating-point precision settings, including mixed precision, on training duration.

In practice

Topics

Best for: MLOps Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.