IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

· Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Advanced, long

Summary

IBM Research and UC Berkeley collaborated to diagnose why agentic LLM systems fail in real-world IT automation tasks, such as incident triage and Kubernetes actions. They applied MAST (Multi-Agent System Failure Taxonomy) to analyze 310 ITBench SRE traces from Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B models. Key findings indicate that frontier models like Gemini-3-Flash exhibit fewer, isolated failure modes (2.6/trace), often related to incorrect verification. In contrast, large open models like GPT-OSS-120B suffer from cascading failures (5.3/trace) due to reasoning mismatches and context poisoning. Kimi-K2 struggles with premature termination and unawareness of termination conditions, alongside significant action-reasoning mismatch. The study emphasizes that traditional success rate metrics are insufficient, advocating for detailed failure diagnostics to enable targeted engineering interventions.

Key takeaway

For AI Architects and Machine Learning Engineers building agents for enterprise IT workflows, understanding specific failure modes is crucial. Instead of relying solely on success rates, use diagnostic tools like MAST to identify "fatal" flaws such as incorrect verification (FM-3.3) or loss of conversation history (FM-1.4). This allows for targeted engineering solutions, like externalizing verification for Gemini-3-Flash or implementing state machines for Kimi-K2, leading to more robust and reliable agentic systems.

Key insights

MAST diagnoses agentic LLM failures beyond success rates, revealing distinct failure signatures across model classes.

Principles

Method

MAST converts unstructured execution logs into structured "failure vectors" across 14 patterns in three categories: System Design, Inter-Agent Misalignment, and Task Verification, enabling fine-grained failure analysis.

In practice

Topics

Code references

Best for: AI Architect, Machine Learning Engineer, AI Scientist, MLOps Engineer, AI Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.