LATS-RCA: Language Agent Tree Search for Root Cause Analysis in Microservices
Summary
LATS-RCA is an LLM-based multi-agent framework designed for automated root cause analysis (RCA) in microservice-based systems (MSS). It formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm, where multiple LLM-driven agents iteratively analyze execution logs and performance metrics. Reflection scores derived from intermediate diagnostic states guide the search toward the most likely root cause. Evaluated on the open-source Light-OAuth2 (LO2) system, LATS-RCA achieved 91.3% diagnostic accuracy, incurring 53.1 API calls, 156K tokens, and 9.1 minutes per investigation. In a production microservice environment (Prod) with higher complexity, accuracy ranged 60-70%, with costs of 75 API calls, 220K tokens, and 13 minutes, highlighting real-world challenges like polyglot tech stacks and varied logging practices.
Key takeaway
For MLOps Engineers managing complex microservice environments, LATS-RCA offers a promising approach to automate root cause analysis. You should consider adopting tree-structured search methods with LLM-driven reflection to improve diagnostic accuracy beyond linear reasoning, especially when dealing with ambiguous evidence. Be prepared for increased computational costs and the need for robust log and metric normalization pipelines in polyglot production systems.
Key insights
LATS-RCA uses LLM-driven multi-agent tree search with reflection to diagnose microservice anomalies.
Principles
- RCA benefits from multi-path exploration of hypotheses.
- Reflection scores guide diagnostic search toward likely causes.
- Cross-modal handoff prevents circular reasoning.
Method
LATS-RCA employs log and metric agents coordinated by a supervisor. Each agent performs a reflection-guided Monte Carlo Tree Search (MCTS) using UCT scores for selection, expanding candidate actions, scoring them via reflection, and backpropagating rewards.
In practice
- Normalize heterogeneous logs and metrics for LLM input.
- Account for multi-factor root causes in production systems.
- Implement cross-modal information transfer between agents.
Topics
- Root Cause Analysis
- Microservices
- Large Language Models
- Multi-Agent Systems
- Language Agent Tree Search
- Observability
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.