OpenRCA 2.0: From Outcome Labels to Causal Process Supervision
Summary
OpenRCA 2.0 is introduced as the first cross-system root cause analysis (RCA) benchmark featuring step-wise causal annotations for LLM agents, comprising 500 instances. This benchmark addresses a fundamental gap in existing datasets, which typically label only the root cause, simplifying the task. To achieve this, the PAVE protocol was developed, leveraging known fault injection interventions to reconstruct causal propagation paths through forward verification. Evaluation across 11 frontier LLMs on OpenRCA 2.0 revealed that exact root-cause set recovery succeeds in only 20.7% of cases. While agents identify at least one correct root-cause service in 76.0% of cases, they ground it in a verified causal path to the symptom in only 61.5%, a failure mode termed "ungrounded diagnosis" that outcome-only evaluation hides.
Key takeaway
For AI Engineers developing or evaluating LLM agents for root cause analysis, you must move beyond outcome-only metrics. Your evaluation should incorporate step-wise causal ground truth to identify "ungrounded diagnosis" failures, where agents correctly identify a service but fail to verify its causal path. Prioritize models that demonstrate strong causal path grounding, as this is crucial for trustworthy and actionable RCA outputs.
Key insights
Robust LLM-based root cause analysis requires causal process supervision and step-wise path validation, not just outcome labels.
Principles
- Root cause analysis holistically tests LLM agentic capabilities.
- Outcome-only RCA evaluation hides critical "ungrounded diagnosis" failures.
- Forward verification from cause to effect reconstructs causal propagation paths.
Method
The PAVE protocol uses fault injection to reconstruct causal propagation paths via forward verification, enabling step-wise causal annotations for RCA benchmarks.
In practice
- Evaluate LLM agents using step-wise causal annotations.
- Implement forward verification for RCA path reconstruction.
- Assess LLMs for "ungrounded diagnosis" beyond outcome metrics.
Topics
- Root Cause Analysis
- LLM Agents
- Causal Inference
- Benchmarking
- PAVE Protocol
- Fault Injection
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.