STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

STAR, a Stage-attributed Triage and Repair framework, enhances the reliability of LLM-based root cause analysis (RCA) agents in microservice AIOps. These agents often suffer from error propagation during evidence collection, hypothesis formulation, or causal analysis. STAR addresses this by explicitly decomposing the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR). Built on LangGraph, STAR implements stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. Evaluated on a public benchmark and a real-world production dataset with two RCA agent workflows and three foundation models, STAR consistently improved root cause localization and fault type classification compared to strong baselines. It accurately identifies faulty stages and repairs most incorrect traces within one or two replay rounds.

Key takeaway

For AI Architects designing or deploying LLM-based RCA agents in microservices, consider integrating a stage-attributed triage and repair framework like STAR. This approach can significantly improve diagnostic accuracy and agent reliability by localizing and repairing errors at specific workflow stages, rather than treating them as end-to-end failures. Your systems will become more debuggable and self-repairing, reducing incident resolution times.

Key insights

Decomposing LLM-based RCA into stages enables precise error localization and targeted repair, improving reliability.

Principles

Method

STAR decomposes RCA into four stages, performs stage-wise auditing, uses Fast/Slow Routing, and applies counterfactual evaluation for decisive stage localization and patch-and-replay repair.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Scientist, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.