Introducing ARFBench: A time series question-answering benchmark based on real incidents

2026-04-27 · Source: Machine Learning Blog | ML@CMU | Carnegie Mellon University · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Datadog AI Research, Carnegie Mellon University, and Amazon AI Research introduced the Anomaly Reasoning Framework Benchmark (ARFBench), a new time series question-answering (TSQA) benchmark published on April 27, 2026. Derived from 63 real internal incidents and 142 time series at Datadog, ARFBench features 750 QA pairs designed to test compositional reasoning across three tiers of difficulty, using real production data and expert annotations. Initial evaluations show that leading LLMs, VLMs, and time series foundation models (TSFMs) struggle, with GPT-5 (VLM) achieving the highest existing model performance at 62.7% accuracy and 51.9% F1. The researchers also developed Toto-1.0-QA-Experimental, a hybrid TSFM-VLM model combining Datadog's Toto and Qwen3-VL 32B, which achieved 63.9% accuracy and 48.9% F1, outperforming other models in anomaly identification tasks. Furthermore, the study highlights a human-AI complementarity, with a model-expert oracle achieving 87.2% accuracy and 82.8% F1, suggesting a new superhuman frontier for incident response.

Key takeaway

For AI Engineers developing models for incident response, ARFBench offers a robust, real-world benchmark to validate and improve time series question-answering capabilities. You should explore hybrid TSFM-VLM architectures like Toto-1.0-QA-Experimental, which demonstrate superior performance in anomaly identification and offer efficiency gains. Additionally, consider designing systems that leverage human-AI complementarity, as combining expert knowledge with model strengths can achieve significantly higher accuracy in diagnosing system failures.

Key insights

ARFBench, a new benchmark, reveals current AI models struggle with real-world time series anomaly reasoning, but hybrid models and human-AI collaboration show promise.

Principles

Real-world data improves benchmark relevance.
Hybrid models can outperform unimodal approaches.
Human-AI collaboration enhances performance.

Method

ARFBench uses an LLM pipeline to generate multiple-choice QA pairs from real Datadog incident timelines and time series, which are then manually verified. It enriches time series with captions and multivariate groupings to provide meaningful context.

In practice

Evaluate models on ARFBench for TSQA.
Consider hybrid TSFM-VLM architectures.
Integrate human experts with AI for incident response.

Topics

ARFBench
Time Series Question Answering
Incident Response
Hybrid AI Models
Observability Metrics

Code references

DataDog/ARFBench

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning Blog | ML@CMU | Carnegie Mellon University.