SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Hardware Reliability · Depth: Expert, extended

Summary

SSH-Net introduces a novel deep neural network (DNN) designed to predict failure time distribution functions under competing risks, specifically addressing challenges posed by complex application scenarios and hierarchical data structures. This approach associates the neural network structure with data structures, enabling different covariate groups to influence failure prediction through dedicated sub-networks. SSH-Net outputs cause-specific hazard functions and employs a penalized log-likelihood as its loss function. Validation through simulation studies demonstrated SSH-Net's superior prediction accuracy, consistently outperforming Neural Fine Gray (NFG) and DeepHit across Brier score, AUC, and RMSE metrics. The model was successfully applied to the Titan GPU failure time data, comprising 19,319 GPU units with 1127 OTB and 3093 DBE failures over nearly seven years, effectively predicting Cumulative Incident Functions (CIFs) and identifying critical factors like cage and cabinet locations influencing GPU reliability.

Key takeaway

For Machine Learning Engineers or Reliability Engineers modeling time-to-event data with competing risks, especially in systems with hierarchical covariates, you should consider SSH-Net. Its structured neural network design and penalized log-likelihood loss consistently deliver superior prediction accuracy (lower Brier score, higher AUC) over models like NFG and DeepHit. This approach helps you better understand and predict failure distributions, such as GPU failures influenced by physical location, enabling more informed maintenance and design decisions.

Key insights

SSH-Net improves competing risks survival analysis by structuring neural networks to data hierarchies, enhancing prediction and interpretability.

Principles

Method

SSH-Net constructs cause-specific hazard functions as piecewise constant segments. It processes hierarchical covariates via separate sub-networks, using a penalized log-likelihood loss with a smoothness term to predict failure time distributions.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.