Agentic Diagnosis for LLM Training at Scale

2026-03-09 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

AMD's MaxText-Slurm, an open-source launch and observability system for MaxText LLM training on AMD Instinct GPU clusters, now includes "agentic diagnostic skills." These skills enable AI agents like Cursor or Claude Code to autonomously execute structured runbooks, moving from symptom to root cause analysis. Unlike chatbots, these AI agents have tool access (shell, file system, Prometheus HTTP queries) and systematically read logs, query metrics, interpret results, and chain steps to reach actionable conclusions. The system leverages a unified Prometheus time-series database (TSDB) that collects GPU, host, network, and training metrics. Five case studies demonstrate the agent's ability to perform one-prompt performance profiling, diagnose RDMA degradation, identify heartbeat false-positives, uncover subtle checkpoint restore leaks, and differentiate between system and model-behavior-driven throughput declines, all within minutes.

Key takeaway

For AI/ML Engineers managing large-scale LLM training on AMD Instinct clusters, integrating MaxText-Slurm's agentic diagnosis can significantly reduce debugging time. You should set up the AI agent within your training environment to leverage its autonomous log triage, performance profiling, and TSDB analysis capabilities. This approach allows for rapid identification of root causes, whether system-level issues like RDMA degradation or training-specific problems like MoE load balance shifts, freeing your team to focus on optimization rather than manual troubleshooting.

Key insights

AI agents with tool access can autonomously diagnose complex LLM training issues using structured diagnostic skills and unified telemetry.

Principles

Unified telemetry is critical for cross-domain hypothesis testing.
Structured runbooks enable systematic, repeatable diagnostics.
Agentic diagnosis can rule out system causes to pinpoint model behavior.

Method

The diagnostic pipeline uses `job-log-triage` to classify failures, `tsdb-diagnosis` for system-level root causes, and `performance-analysis` for compute-level profiling, with automatic handoff between skills.

In practice

Run AI agents inside containers on cluster login nodes for secure access.
Bind-mount repo directories for job output accessibility.
Use `Ctrl+P, Ctrl+Q` to detach from containers without stopping them.

Topics

Agentic Diagnosis
LLM Training
Observability Stack
Prometheus TSDB
AMD Instinct GPUs

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.