Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs
Summary
A study on fault diagnosis in Deep Learning (DL) programs reveals a significant "evaluation-strategy gap" when assessing diagnostic techniques. Researchers investigated this gap using DynFault, a corpus of 5,542 fault-injected training traces from 38 real-world DL programs. They found a 0.190 balanced accuracy drop for existing fault diagnosis techniques when moving from within-program cross-validation to program-held-out evaluation, where programs in the test set are entirely unseen during training. This performance decline is attributed to program-level structural features. The analysis further showed that curvature features are effective for detecting instability in unseen programs, while optimizer and activation features primarily benefit diagnosis only on programs present in the training data, highlighting a limitation in their generalizability.
Key takeaway
For MLOps Engineers evaluating or deploying deep learning fault diagnosis tools, you must adopt program-held-out evaluation strategies to accurately assess real-world performance. Relying solely on within-program metrics will overestimate diagnostic capabilities, especially for unseen programs. Prioritize solutions that demonstrate generalizability, such as those leveraging curvature features for instability detection, and critically re-evaluate the cross-program utility of optimizer and activation features before investing in their logging.
Key insights
The evaluation strategy significantly impacts DL fault diagnosis performance, revealing a gap in technique generalizability.
Principles
- Fault diagnosis performance is overestimated by within-program evaluation.
- Program-level feature structure can hinder diagnostic generalization.
- Runtime feature effectiveness varies significantly across evaluation strategies.
Method
The study used DynFault, a corpus of 5,542 fault-injected training traces from 38 DL programs, comparing within-program and program-held-out evaluation strategies across three diagnostic tasks and two feature sets.
In practice
- Prioritize program-held-out evaluation for new DL fault diagnosis techniques.
- Integrate curvature features for robust instability detection.
- Re-evaluate optimizer/activation features for cross-program utility.
Topics
- Deep Learning Fault Diagnosis
- Evaluation Strategy Gap
- DynFault Corpus
- Curvature Features
- Training Instability
- Model Generalization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.