STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Summary
STAR, a Stage-attributed Triage and Repair framework, enhances the reliability of LLM-based root cause analysis (RCA) agents in microservices by addressing error propagation. It explicitly breaks down the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR). STAR treats agent failures as stage-localizable bugs, performing stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. Evaluated on a public large-scale benchmark and a real-world production dataset using two RCA agent workflows (mABC, RCAgent) and three foundation models (GPT-5, Qwen3-Max, Gemini-2.5-Pro), STAR consistently improved root cause localization and fault type classification. It accurately identifies faulty stages and repairs most incorrect traces within one or two replay rounds, demonstrating significant benefits from its routing and evaluation mechanisms.
Key takeaway
For AI Architects and AI Engineers designing or deploying LLM-based RCA systems, consider integrating a stage-attributed repair framework like STAR. This approach significantly improves diagnostic accuracy and efficiency by pinpointing and correcting errors at specific workflow stages rather than treating them as monolithic failures. Implementing structured stages and replay mechanisms can lead to more robust, debuggable, and self-repairing agentic RCA systems, reducing incident resolution times and operational costs.
Key insights
STAR improves LLM-based RCA agent reliability by localizing and repairing errors within specific diagnostic stages.
Principles
- Decompose complex reasoning into structured, auditable stages.
- Match repair intensity to error severity and contamination.
- Validate stage attribution through counterfactual replay.
Method
STAR audits RCA traces, routes based on reliability scores, identifies the earliest decisive faulty stage via counterfactual evaluation, applies stage-specific patches, and replays downstream reasoning to correct errors.
In practice
- Implement RCA workflows with explicit EP, HS, AS, DR stages.
- Use LangGraph for node-level execution and controllable replay.
- Employ LLM-as-a-Judge for evaluating stage localization accuracy.
Topics
- STAR Framework
- LLM-based RCA Agents
- Microservice Root Cause Analysis
- Stage-attributed Repair
- Fast/Slow Routing
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.