Model selection with proper scoring rules on data sets of time series

2026-06-24 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A study investigates model selection for probabilistic models on time series datasets, focusing on proper scoring rules. It reveals that common summary statistics—mean score, median score, and mean rank—can yield conflicting decisions due to the skewness of score distributions. While these criteria converge with larger test sets ($n_{te}$), for short test sets, only the mean score reliably identifies the true model. The research illustrates this using intermittent time series, including the M5 competition dataset, comparing Poisson and negative binomial distributions. It highlights that mean rank decisions are sensitive to $n_{te}$ and high quantile levels (e.g., $\text{QS}_{0.975}$, $\text{QS}_{0.995}$), often selecting a misspecified model, whereas the mean scaled score remains robust across varying $n_{te}$ and scaling factors.

Key takeaway

For data scientists evaluating probabilistic time series models, prioritize the mean scaled score over mean rank, especially when dealing with short test sets or high quantile scores like $\text{QS}_{0.975}$ or $\text{QS}_{0.995}$. Conflicting results often stem from skewed score distributions, where mean rank can misidentify the best model. Always validate your model selection by checking results across at least two different scaling factors to ensure robustness.

Key insights

Skewed score distributions cause conflicting model selection outcomes, making mean scaled score more reliable than mean rank for time series.

Principles

Proper scoring rules are minimized by the true distribution in expectation.
Skewness of scores increases with high quantile levels or small test sets.
Mean rank can select misspecified models with short test sets.

Method

The paper compares mean score, median score, and mean rank for aggregating scores across multiple time series, analyzing their convergence and sensitivity to test set length ($n_{te}$) and scaling factors.

In practice

Validate model selection using multiple scaling factors.
Exercise caution with mean ranks for high quantile scores.

Topics

Time Series Model Selection
Probabilistic Forecasting
Proper Scoring Rules
Mean Scaled Score
Mean Rank
M5 Competition Dataset

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, Data Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.