A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis

2025-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Kanglin Yin introduces two large-scale, competition-validated datasets, AIOps2025 and RCA100, designed to benchmark LLM agents in microservice failure diagnosis by evaluating their systematic reasoning processes, not just final answers. AIOps2025 comprises 400 expert-labeled failure cases from a HipsterShop system, offering 11.9 GB of multimodal observability data (Metrics, Logs, Traces) and key-evidence coverage labels. This dataset powered the 2025 CCF AIOps Challenge, engaging 561 teams. Complementarily, RCA100 provides 103 fault events from an OpenTelemetry Demo Store on Alibaba Cloud ACK, featuring 3.4 GB across six modalities (Metrics, Logs, Traces, Events, Alerts, Topology) and detailed causal-chain coverage with 661 evidence checkpoints. It was utilized in the Tianchi 2025 AIOps Track, attracting 5,532 teams. Both benchmarks assess agent performance across Localization, Identification, and Reason dimensions, collectively validating their robustness with over 6,000 participating teams.

Key takeaway

For AI Engineers developing LLM agents for microservice root cause analysis, you must move beyond simple final-answer matching. Your evaluation strategy should incorporate reasoning-process metrics like Localization, Identification, and evidence-grounded Reason. Utilize benchmarks such as AIOps2025 and RCA100 to validate agent capabilities in handling multimodal data and complex causal chains, ensuring your agents provide transparent and justifiable diagnoses in production environments.

Key insights

LLM agent evaluation for microservice diagnosis requires assessing reasoning processes, not just final answers, using multimodal, expert-validated benchmarks.

Principles

LLM agent evaluation must assess reasoning processes.
Multimodal data fusion is critical for diagnosis.
Benchmarks need hierarchical fault coverage.

Method

Evaluate LLM agents using a reasoning-process paradigm across Localization, Identification, and Reason pillars. Ground truth labeling involves multimodal anomaly detectors, independent expert review, and senior expert adjudication, with re-injected faults for validation.

In practice

Use AIOps2025 for key-evidence coverage.
Apply RCA100 for causal-chain reasoning.
Standardize observability data for agents.

Topics

LLM Agents
Microservice Diagnosis
AIOps Benchmarks
Root Cause Analysis
Multimodal Observability
Reasoning Process Evaluation

Code references

Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.