Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

· Source: stat.ML updates on arXiv.org · Field: Science & Research — Environmental Science & Earth Systems, Mathematics & Computational Sciences · Depth: Expert, long

Summary

Researchers have introduced WEATHER-5K, a new large-scale dataset designed to improve Global Station Weather Forecasting (GSWF) and general time-series benchmarking. This dataset addresses critical limitations of existing public meteorological datasets, which often suffer from small sizes, limited temporal coverage, and insufficient variables. WEATHER-5K comprises comprehensive data from 5,672 weather stations globally, covering a 10-year period from 2014 to 2023 with one-hour intervals, and includes multiple crucial weather elements. The data was meticulously collected from the National Centers for Environmental Information (NCEI) Integrated Surface Database and underwent rigorous post-processing, including gap-filling with linear interpolation and ERA5 reanalysis. This resource provides a robust foundation for evaluating and advancing various time-series forecasting models, with the dataset and benchmark implementation publicly available.

Key takeaway

For Machine Learning Engineers developing global weather or general time-series forecasting models, you should integrate the new WEATHER-5K dataset into your evaluation pipeline. Its comprehensive 10-year, hourly data from 5,672 stations worldwide offers a robust benchmark for model generalization and identifying complex patterns. Utilizing this dataset will help you overcome limitations of smaller, outdated datasets, enabling more accurate and reliable predictions for operational services.

Key insights

The WEATHER-5K dataset provides a comprehensive, large-scale benchmark for global station weather and general time-series forecasting, addressing prior data limitations.

Principles

Method

WEATHER-5K was created by selecting 5,672 operational, hourly reporting stations from NCEI ISD (2014-2023). Missing hourly data was estimated using nearest 30-minute observations, followed by linear interpolation and ERA5 reanalysis for remaining gaps.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.