Introducing ARFBench: A time series question-answering benchmark based on real incidents
Summary
Datadog has introduced ARFBench, a new Time Series Question-Answering (TSQA) benchmark designed to evaluate AI models on incident response tasks using real-world telemetry. Derived from 63 internal Datadog incidents and 142 time series, ARFBench features 750 question-answer pairs with varying difficulty tiers, expert annotations, and multimodal context including time series captions and multivariate groupings. Initial evaluations show that leading LLMs, VLMs, and Time Series Foundation Models (TSFMs) like GPT-5 (62.7% accuracy) significantly underperform human experts and a model-expert oracle (87.2% accuracy). However, a new hybrid TSFM-VLM model, Toto-1.0-QA-Experimental, combining Datadog's Toto TSFM with Qwen3-VL 32B, achieved 63.9% accuracy, demonstrating promising performance with fewer parameters and superior results in anomaly identification tasks.
Key takeaway
For Research Scientists developing AI models for incident response, you should prioritize evaluating your models against ARFBench to gauge their real-world applicability. The benchmark highlights that current frontier models have significant room for improvement, particularly in compositional reasoning and handling complex, multimodal context. Focus on developing hybrid TSFM-VLM architectures, as demonstrated by Toto-1.0-QA-Experimental, to achieve better performance and efficiency, especially for critical anomaly identification tasks.
Key insights
ARFBench, a new TSQA benchmark, reveals existing AI models struggle with real-world incident data, while hybrid models show promise.
Principles
- Real-world data improves benchmark relevance.
- Hybrid models can outperform unimodal approaches.
- Human-AI collaboration enhances incident resolution.
Method
ARFBench generates QA pairs from real incident time series and timelines using an LLM pipeline, then manually verifies them. It enriches time series with captions and multivariate groupings for context.
In practice
- Evaluate models on ARFBench for TSQA tasks.
- Consider hybrid TSFM-VLM architectures.
- Combine human expertise with AI for incident response.
Topics
- ARFBench
- Time Series Question-Answering
- Incident Response
- Hybrid AI Models
- Observability Metrics
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ΑΙhub.