TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

2026-06-18 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

TS-Fault is a new benchmark for time series forecasting (TSF) models, designed to evaluate robustness against structured faults rather than solely clean-data accuracy. It organizes real-world failures into four distinct modes along two orthogonal axes: observation- vs. mechanism-level and univariate vs. multivariate. Crucially, faults are injected into the most prediction-critical windows, identified by a unified importance score. An evaluation of 21 models across 6 datasets, 4 modes, and 5 difficulty levels revealed three key findings: clean-data accuracy often anti-correlates with robustness, clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults, and all catastrophic failures occurred under mechanism-level faults, with foundation models exhibiting high clean-data accuracy but significant fragility. The benchmark's code is publicly available.

Key takeaway

For MLOps Engineers deploying time series forecasting models, relying solely on clean-data accuracy leaderboards is risky. You should integrate TS-Fault's structured fault evaluation into your model selection process. This reveals hidden fragilities, especially with mechanism-level faults and foundation models, ensuring your deployed systems are robust against real-world failures rather than failing catastrophically under structured events.

Key insights

TS-Fault benchmarks time series forecasters against structured, parameterized faults to reveal deployment fragility.

Principles

Clean-data accuracy can anti-correlate with robustness.
Mechanism-level faults reshuffle model rankings.
Foundation models show high clean-data accuracy but fragility.

Method

TS-Fault injects parameterized faults into prediction-critical windows, identified by a four-component importance score, across four distinct fault modes (Time-Warped Shock, Dependency-Fracture Shock, Regime-Transition Missingness, Cascading Sensor-to-System Failure).

In practice

Evaluate models under Time-Warped Shock (Mode I).
Test Dependency-Fracture Shock (Mode II) for multivariate models.
Assess Regime-Transition Missingness (Mode III) impact.

Topics

Time Series Forecasting
Model Robustness
Benchmarking
Fault Injection
Foundation Models
MLOps

Code references

Ray-zyy/TS-Fault

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.