Introducing ARFBench: A time series question-answering benchmark based on real incidents

2026-05-18 · Source: ΑΙhub · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, long

Summary

Datadog has introduced ARFBench, a new Time Series Question-Answering (TSQA) benchmark designed to evaluate AI models on incident response tasks using real-world telemetry. Derived from 63 internal Datadog incidents and 142 time series, ARFBench features 750 question-answer pairs with varying difficulty tiers, expert annotations, and multimodal context including time series captions and multivariate groupings. Initial evaluations show that leading LLMs, VLMs, and Time Series Foundation Models (TSFMs) like GPT-5 (62.7% accuracy) significantly underperform human experts and a model-expert oracle (87.2% accuracy). However, a new hybrid TSFM-VLM model, Toto-1.0-QA-Experimental, combining Datadog's Toto TSFM with Qwen3-VL 32B, achieved 63.9% accuracy, demonstrating promising performance with fewer parameters and superior results in anomaly identification tasks.

Key takeaway

For Research Scientists developing AI models for incident response, you should prioritize evaluating your models against ARFBench to gauge their real-world applicability. The benchmark highlights that current frontier models have significant room for improvement, particularly in compositional reasoning and handling complex, multimodal context. Focus on developing hybrid TSFM-VLM architectures, as demonstrated by Toto-1.0-QA-Experimental, to achieve better performance and efficiency, especially for critical anomaly identification tasks.

Key insights

ARFBench, a new TSQA benchmark, reveals existing AI models struggle with real-world incident data, while hybrid models show promise.

Principles

Real-world data improves benchmark relevance.
Hybrid models can outperform unimodal approaches.
Human-AI collaboration enhances incident resolution.

Method

ARFBench generates QA pairs from real incident time series and timelines using an LLM pipeline, then manually verifies them. It enriches time series with captions and multivariate groupings for context.

In practice

Evaluate models on ARFBench for TSQA tasks.
Consider hybrid TSFM-VLM architectures.
Combine human expertise with AI for incident response.

Topics

ARFBench
Time Series Question-Answering
Incident Response
Hybrid AI Models
Observability Metrics

Code references

DataDog/ARFBench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ΑΙhub.