STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

2026-05-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

STAR, a Stage-attributed Triage and Repair framework, enhances the reliability of LLM-based root cause analysis (RCA) agents in microservice AIOps. These agents often suffer from error propagation during evidence collection, hypothesis formulation, or causal analysis. STAR addresses this by explicitly decomposing the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR). Built on LangGraph, STAR implements stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. Evaluated on a public benchmark and a real-world production dataset with two RCA agent workflows and three foundation models, STAR consistently improved root cause localization and fault type classification compared to strong baselines. It accurately identifies faulty stages and repairs most incorrect traces within one or two replay rounds.

Key takeaway

For AI Architects designing or deploying LLM-based RCA agents in microservices, consider integrating a stage-attributed triage and repair framework like STAR. This approach can significantly improve diagnostic accuracy and agent reliability by localizing and repairing errors at specific workflow stages, rather than treating them as end-to-end failures. Your systems will become more debuggable and self-repairing, reducing incident resolution times.

Key insights

Decomposing LLM-based RCA into stages enables precise error localization and targeted repair, improving reliability.

Principles

Stage-localizable reasoning bugs are more manageable than monolithic errors.
Explicitly modeling failure location enhances agent reliability and debuggability.

Method

STAR decomposes RCA into four stages, performs stage-wise auditing, uses Fast/Slow Routing, and applies counterfactual evaluation for decisive stage localization and patch-and-replay repair.

In practice

Implement stage-wise auditing in agent workflows.
Utilize counterfactual evaluation for error localization.
Apply patch-and-replay for targeted repairs.

Topics

STAR Framework
RCA Agents
Microservices AIOps
Stage-attributed Repair
LangGraph

Best for: Research Scientist, AI Architect, AI Scientist, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.