Introducing ARFBench: A time series question-answering benchmark based on real incidents
Summary
Datadog AI Research, Carnegie Mellon University, and Amazon AI Research introduced the Anomaly Reasoning Framework Benchmark (ARFBench), a new time series question-answering (TSQA) benchmark published on April 27, 2026. Derived from 63 real internal incidents and 142 time series at Datadog, ARFBench features 750 QA pairs designed to test compositional reasoning across three tiers of difficulty, using real production data and expert annotations. Initial evaluations show that leading LLMs, VLMs, and time series foundation models (TSFMs) struggle, with GPT-5 (VLM) achieving the highest existing model performance at 62.7% accuracy and 51.9% F1. The researchers also developed Toto-1.0-QA-Experimental, a hybrid TSFM-VLM model combining Datadog's Toto and Qwen3-VL 32B, which achieved 63.9% accuracy and 48.9% F1, outperforming other models in anomaly identification tasks. Furthermore, the study highlights a human-AI complementarity, with a model-expert oracle achieving 87.2% accuracy and 82.8% F1, suggesting a new superhuman frontier for incident response.
Key takeaway
For AI Engineers developing models for incident response, ARFBench offers a robust, real-world benchmark to validate and improve time series question-answering capabilities. You should explore hybrid TSFM-VLM architectures like Toto-1.0-QA-Experimental, which demonstrate superior performance in anomaly identification and offer efficiency gains. Additionally, consider designing systems that leverage human-AI complementarity, as combining expert knowledge with model strengths can achieve significantly higher accuracy in diagnosing system failures.
Key insights
ARFBench, a new benchmark, reveals current AI models struggle with real-world time series anomaly reasoning, but hybrid models and human-AI collaboration show promise.
Principles
- Real-world data improves benchmark relevance.
- Hybrid models can outperform unimodal approaches.
- Human-AI collaboration enhances performance.
Method
ARFBench uses an LLM pipeline to generate multiple-choice QA pairs from real Datadog incident timelines and time series, which are then manually verified. It enriches time series with captions and multivariate groupings to provide meaningful context.
In practice
- Evaluate models on ARFBench for TSQA.
- Consider hybrid TSFM-VLM architectures.
- Integrate human experts with AI for incident response.
Topics
- ARFBench
- Time Series Question Answering
- Incident Response
- Hybrid AI Models
- Observability Metrics
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning Blog | ML@CMU | Carnegie Mellon University.