A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis
Summary
Kanglin Yin introduces two large-scale, competition-validated datasets, AIOps2025 and RCA100, designed to benchmark LLM agents in microservice failure diagnosis by evaluating their systematic reasoning processes, not just final answers. AIOps2025 comprises 400 expert-labeled failure cases from a HipsterShop system, offering 11.9 GB of multimodal observability data (Metrics, Logs, Traces) and key-evidence coverage labels. This dataset powered the 2025 CCF AIOps Challenge, engaging 561 teams. Complementarily, RCA100 provides 103 fault events from an OpenTelemetry Demo Store on Alibaba Cloud ACK, featuring 3.4 GB across six modalities (Metrics, Logs, Traces, Events, Alerts, Topology) and detailed causal-chain coverage with 661 evidence checkpoints. It was utilized in the Tianchi 2025 AIOps Track, attracting 5,532 teams. Both benchmarks assess agent performance across Localization, Identification, and Reason dimensions, collectively validating their robustness with over 6,000 participating teams.
Key takeaway
For AI Engineers developing LLM agents for microservice root cause analysis, you must move beyond simple final-answer matching. Your evaluation strategy should incorporate reasoning-process metrics like Localization, Identification, and evidence-grounded Reason. Utilize benchmarks such as AIOps2025 and RCA100 to validate agent capabilities in handling multimodal data and complex causal chains, ensuring your agents provide transparent and justifiable diagnoses in production environments.
Key insights
LLM agent evaluation for microservice diagnosis requires assessing reasoning processes, not just final answers, using multimodal, expert-validated benchmarks.
Principles
- LLM agent evaluation must assess reasoning processes.
- Multimodal data fusion is critical for diagnosis.
- Benchmarks need hierarchical fault coverage.
Method
Evaluate LLM agents using a reasoning-process paradigm across Localization, Identification, and Reason pillars. Ground truth labeling involves multimodal anomaly detectors, independent expert review, and senior expert adjudication, with re-injected faults for validation.
In practice
- Use AIOps2025 for key-evidence coverage.
- Apply RCA100 for causal-chain reasoning.
- Standardize observability data for agents.
Topics
- LLM Agents
- Microservice Diagnosis
- AIOps Benchmarks
- Root Cause Analysis
- Multimodal Observability
- Reasoning Process Evaluation
Code references
Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.