LATS-RCA: Language Agent Tree Search for Root Cause Analysis in Microservices

2026-06-16 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

LATS-RCA is an LLM-based multi-agent framework designed for automated root cause analysis (RCA) in microservice-based systems (MSS). It formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm, where multiple LLM-driven agents iteratively analyze execution logs and performance metrics. Reflection scores derived from intermediate diagnostic states guide the search toward the most likely root cause. Evaluated on the open-source Light-OAuth2 (LO2) system, LATS-RCA achieved 91.3% diagnostic accuracy, incurring 53.1 API calls, 156K tokens, and 9.1 minutes per investigation. In a production microservice environment (Prod) with higher complexity, accuracy ranged 60-70%, with costs of 75 API calls, 220K tokens, and 13 minutes, highlighting real-world challenges like polyglot tech stacks and varied logging practices.

Key takeaway

For MLOps Engineers managing complex microservice environments, LATS-RCA offers a promising approach to automate root cause analysis. You should consider adopting tree-structured search methods with LLM-driven reflection to improve diagnostic accuracy beyond linear reasoning, especially when dealing with ambiguous evidence. Be prepared for increased computational costs and the need for robust log and metric normalization pipelines in polyglot production systems.

Key insights

LATS-RCA uses LLM-driven multi-agent tree search with reflection to diagnose microservice anomalies.

Principles

RCA benefits from multi-path exploration of hypotheses.
Reflection scores guide diagnostic search toward likely causes.
Cross-modal handoff prevents circular reasoning.

Method

LATS-RCA employs log and metric agents coordinated by a supervisor. Each agent performs a reflection-guided Monte Carlo Tree Search (MCTS) using UCT scores for selection, expanding candidate actions, scoring them via reflection, and backpropagating rewards.

In practice

Normalize heterogeneous logs and metrics for LLM input.
Account for multi-factor root causes in production systems.
Implement cross-modal information transfer between agents.

Topics

Root Cause Analysis
Microservices
Large Language Models
Multi-Agent Systems
Language Agent Tree Search
Observability

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.