ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Summary
Artificial Analysis and IBM Software Innovation Lab launched ITBench-AA on May 27, 2026, the first benchmark for agentic enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). This benchmark evaluates frontier models on 59 SRE tasks, including 40 public and 19 held-out scenarios, requiring them to diagnose live Kubernetes incidents by analyzing logs, traces, metrics, and application topology via shell access in a sandboxed file system. Models must identify the minimal set of independent root-cause Kubernetes entities. Key findings show all frontier models score below 50%, with Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leading at 47%, followed by GPT-5.5 (xhigh) at 46%, and Qwen3.7 Max at 42%. Open-weights models like GLM-5.1 (Reasoning) achieved 40%. The benchmark highlights that more turns do not equate to higher accuracy, penalizing models for false positives.
Key takeaway
For MLOps Engineers or AI Scientists developing agentic solutions for Site Reliability Engineering, this benchmark reveals current frontier model limitations. You should prioritize models that achieve high precision in root cause identification, as over-investigation or false positives significantly reduce scores. Consider evaluating open-weight models like Gemma 4 31B or GLM-5.1, which offer strong performance at a lower cost per task compared to leading proprietary models, optimizing your deployment strategy for both accuracy and operational expenditure.
Key insights
Frontier models struggle with agentic enterprise IT tasks, scoring below 50% on the new ITBench-AA SRE benchmark.
Principles
- Longer agent trajectories do not guarantee higher accuracy.
- Precision at full recall penalizes false positive root causes.
- Open-weight models offer competitive performance-cost ratios.
Method
Models use an open-source Stirrup harness with shell access to a sandboxed file system to diagnose Kubernetes incidents. They submit root-cause entities, scored by average precision at full recall over 59 tasks and 3 repeats.
In practice
- Evaluate agentic models on Kubernetes incident response.
- Prioritize precision in root cause identification.
- Consider open-weight models for cost-effective SRE automation.
Topics
- ITBench-AA
- Site Reliability Engineering
- Agentic AI
- Kubernetes Incident Response
- LLM Benchmarking
- Open-weight Models
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.