STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Summary
STAR, a Stage-attributed Triage and Repair framework, enhances the reliability of LLM-based root cause analysis (RCA) agents in microservice AIOps. These agents often suffer from error propagation during evidence collection, hypothesis formulation, or causal analysis. STAR addresses this by explicitly decomposing the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR). Built on LangGraph, STAR implements stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. Evaluated on a public benchmark and a real-world production dataset with two RCA agent workflows and three foundation models, STAR consistently improved root cause localization and fault type classification compared to strong baselines. It accurately identifies faulty stages and repairs most incorrect traces within one or two replay rounds.
Key takeaway
For AI Architects designing or deploying LLM-based RCA agents in microservices, consider integrating a stage-attributed triage and repair framework like STAR. This approach can significantly improve diagnostic accuracy and agent reliability by localizing and repairing errors at specific workflow stages, rather than treating them as end-to-end failures. Your systems will become more debuggable and self-repairing, reducing incident resolution times.
Key insights
Decomposing LLM-based RCA into stages enables precise error localization and targeted repair, improving reliability.
Principles
- Stage-localizable reasoning bugs are more manageable than monolithic errors.
- Explicitly modeling failure location enhances agent reliability and debuggability.
Method
STAR decomposes RCA into four stages, performs stage-wise auditing, uses Fast/Slow Routing, and applies counterfactual evaluation for decisive stage localization and patch-and-replay repair.
In practice
- Implement stage-wise auditing in agent workflows.
- Utilize counterfactual evaluation for error localization.
- Apply patch-and-replay for targeted repairs.
Topics
- STAR Framework
- RCA Agents
- Microservices AIOps
- Stage-attributed Repair
- LangGraph
Best for: Research Scientist, AI Architect, AI Scientist, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.