SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data
Summary
SSH-Net introduces a novel deep neural network (DNN) designed to predict failure time distribution functions under competing risks, specifically addressing challenges posed by complex application scenarios and hierarchical data structures. This approach associates the neural network structure with data structures, enabling different covariate groups to influence failure prediction through dedicated sub-networks. SSH-Net outputs cause-specific hazard functions and employs a penalized log-likelihood as its loss function. Validation through simulation studies demonstrated SSH-Net's superior prediction accuracy, consistently outperforming Neural Fine Gray (NFG) and DeepHit across Brier score, AUC, and RMSE metrics. The model was successfully applied to the Titan GPU failure time data, comprising 19,319 GPU units with 1127 OTB and 3093 DBE failures over nearly seven years, effectively predicting Cumulative Incident Functions (CIFs) and identifying critical factors like cage and cabinet locations influencing GPU reliability.
Key takeaway
For Machine Learning Engineers or Reliability Engineers modeling time-to-event data with competing risks, especially in systems with hierarchical covariates, you should consider SSH-Net. Its structured neural network design and penalized log-likelihood loss consistently deliver superior prediction accuracy (lower Brier score, higher AUC) over models like NFG and DeepHit. This approach helps you better understand and predict failure distributions, such as GPU failures influenced by physical location, enabling more informed maintenance and design decisions.
Key insights
SSH-Net improves competing risks survival analysis by structuring neural networks to data hierarchies, enhancing prediction and interpretability.
Principles
- Neural network structure should align with data hierarchies for better tuning.
- Penalized log-likelihood loss improves continuous time prediction accuracy.
- Segmented sub-networks enhance covariate group impact on predictions.
Method
SSH-Net constructs cause-specific hazard functions as piecewise constant segments. It processes hierarchical covariates via separate sub-networks, using a penalized log-likelihood loss with a smoothness term to predict failure time distributions.
In practice
- Predict GPU failure types (DBE, OTB) using hierarchical covariates.
- Analyze relative risk among failure types in complex engineering systems.
Topics
- Competing Risks
- Deep Neural Networks
- Survival Analysis
- GPU Reliability
- Failure Time Prediction
- Hierarchical Covariates
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.