STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

STAR, a Stage-attributed Triage and Repair framework, enhances the reliability of LLM-based root cause analysis (RCA) agents in microservices by addressing error propagation. It explicitly breaks down the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR). STAR treats agent failures as stage-localizable bugs, performing stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. Evaluated on a public large-scale benchmark and a real-world production dataset using two RCA agent workflows (mABC, RCAgent) and three foundation models (GPT-5, Qwen3-Max, Gemini-2.5-Pro), STAR consistently improved root cause localization and fault type classification. It accurately identifies faulty stages and repairs most incorrect traces within one or two replay rounds, demonstrating significant benefits from its routing and evaluation mechanisms.

Key takeaway

For AI Architects and AI Engineers designing or deploying LLM-based RCA systems, consider integrating a stage-attributed repair framework like STAR. This approach significantly improves diagnostic accuracy and efficiency by pinpointing and correcting errors at specific workflow stages rather than treating them as monolithic failures. Implementing structured stages and replay mechanisms can lead to more robust, debuggable, and self-repairing agentic RCA systems, reducing incident resolution times and operational costs.

Key insights

STAR improves LLM-based RCA agent reliability by localizing and repairing errors within specific diagnostic stages.

Principles

Method

STAR audits RCA traces, routes based on reliability scores, identifies the earliest decisive faulty stage via counterfactual evaluation, applies stage-specific patches, and replays downstream reasoning to correct errors.

In practice

Topics

Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.