Introducing ARFBench: A time series question-answering benchmark based on real incidents

· Source: Machine Learning Blog | ML@CMU | Carnegie Mellon University · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Datadog AI Research, Carnegie Mellon University, and Amazon AI Research introduced the Anomaly Reasoning Framework Benchmark (ARFBench), a new time series question-answering (TSQA) benchmark published on April 27, 2026. Derived from 63 real internal incidents and 142 time series at Datadog, ARFBench features 750 QA pairs designed to test compositional reasoning across three tiers of difficulty, using real production data and expert annotations. Initial evaluations show that leading LLMs, VLMs, and time series foundation models (TSFMs) struggle, with GPT-5 (VLM) achieving the highest existing model performance at 62.7% accuracy and 51.9% F1. The researchers also developed Toto-1.0-QA-Experimental, a hybrid TSFM-VLM model combining Datadog's Toto and Qwen3-VL 32B, which achieved 63.9% accuracy and 48.9% F1, outperforming other models in anomaly identification tasks. Furthermore, the study highlights a human-AI complementarity, with a model-expert oracle achieving 87.2% accuracy and 82.8% F1, suggesting a new superhuman frontier for incident response.

Key takeaway

For AI Engineers developing models for incident response, ARFBench offers a robust, real-world benchmark to validate and improve time series question-answering capabilities. You should explore hybrid TSFM-VLM architectures like Toto-1.0-QA-Experimental, which demonstrate superior performance in anomaly identification and offer efficiency gains. Additionally, consider designing systems that leverage human-AI complementarity, as combining expert knowledge with model strengths can achieve significantly higher accuracy in diagnosing system failures.

Key insights

ARFBench, a new benchmark, reveals current AI models struggle with real-world time series anomaly reasoning, but hybrid models and human-AI collaboration show promise.

Principles

Method

ARFBench uses an LLM pipeline to generate multiple-choice QA pairs from real Datadog incident timelines and time series, which are then manually verified. It enriches time series with captions and multivariate groupings to provide meaningful context.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning Blog | ML@CMU | Carnegie Mellon University.