ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

2026-05-28 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, short

Summary

Artificial Analysis and IBM Software Innovation Lab launched ITBench-AA on May 27, 2026, the first benchmark for agentic enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). This benchmark evaluates frontier models on 59 SRE tasks, including 40 public and 19 held-out scenarios, requiring them to diagnose live Kubernetes incidents by analyzing logs, traces, metrics, and application topology via shell access in a sandboxed file system. Models must identify the minimal set of independent root-cause Kubernetes entities. Key findings show all frontier models score below 50%, with Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leading at 47%, followed by GPT-5.5 (xhigh) at 46%, and Qwen3.7 Max at 42%. Open-weights models like GLM-5.1 (Reasoning) achieved 40%. The benchmark highlights that more turns do not equate to higher accuracy, penalizing models for false positives.

Key takeaway

For MLOps Engineers or AI Scientists developing agentic solutions for Site Reliability Engineering, this benchmark reveals current frontier model limitations. You should prioritize models that achieve high precision in root cause identification, as over-investigation or false positives significantly reduce scores. Consider evaluating open-weight models like Gemma 4 31B or GLM-5.1, which offer strong performance at a lower cost per task compared to leading proprietary models, optimizing your deployment strategy for both accuracy and operational expenditure.

Key insights

Frontier models struggle with agentic enterprise IT tasks, scoring below 50% on the new ITBench-AA SRE benchmark.

Principles

Longer agent trajectories do not guarantee higher accuracy.
Precision at full recall penalizes false positive root causes.
Open-weight models offer competitive performance-cost ratios.

Method

Models use an open-source Stirrup harness with shell access to a sandboxed file system to diagnose Kubernetes incidents. They submit root-cause entities, scored by average precision at full recall over 59 tasks and 3 repeats.

In practice

Evaluate agentic models on Kubernetes incident response.
Prioritize precision in root cause identification.
Consider open-weight models for cost-effective SRE automation.

Topics

ITBench-AA
Site Reliability Engineering
Agentic AI
Kubernetes Incident Response
LLM Benchmarking
Open-weight Models

Code references

itbench-hub/ITBench

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.