Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs

2026-06-25 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A study on fault diagnosis in Deep Learning (DL) programs during training reveals a significant "evaluation-strategy gap" in existing techniques. Current methods, often assessed via within-program cross-validation, show a balanced accuracy gap of 0.190 when deployed on previously unseen programs compared to within-program evaluation. Researchers investigated this using DynFault, a corpus comprising 5,542 fault-injected training traces from 38 real-world DL programs. The analysis indicates this performance discrepancy stems from program-level feature structures. Specifically, curvature features proved effective for detecting instability in unseen programs, whereas optimizer and activation features were beneficial only for programs included in the training set.

Key takeaway

For Machine Learning Engineers diagnosing DL program failures, recognize that evaluation metrics from within-program testing may overstate real-world performance on new codebases. You should prioritize diagnostic techniques leveraging curvature features, as these are effective for instability detection in previously unseen programs. Conversely, be aware that optimizer and activation features offer benefits primarily for programs already included in your training data.

Key insights

Fault diagnosis in DL programs exhibits an evaluation gap when applied to unseen code, driven by feature structure.

Principles

Within-program evaluation overestimates performance.
Program-level feature structure impacts diagnosis.
Feature utility varies for unseen programs.

Method

The study used DynFault, a corpus of 5,542 fault-injected training traces from 38 DL programs, to quantify the balanced accuracy gap.

In practice

Prioritize curvature features for new DL programs.
Re-evaluate optimizer features on known programs.
Account for program-level feature differences.

Topics

Deep Learning Fault Diagnosis
Program Instability Detection
Curvature Features
Optimizer Features
Evaluation-Strategy Gap
DynFault Corpus

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.