Agentic Diagnosis for LLM Training at Scale
Summary
AMD's MaxText-Slurm, an open-source launch and observability system for MaxText LLM training on AMD Instinct GPU clusters, now includes "agentic diagnostic skills." These skills enable AI agents like Cursor or Claude Code to autonomously execute structured runbooks, moving from symptom to root cause analysis. Unlike chatbots, these AI agents have tool access (shell, file system, Prometheus HTTP queries) and systematically read logs, query metrics, interpret results, and chain steps to reach actionable conclusions. The system leverages a unified Prometheus time-series database (TSDB) that collects GPU, host, network, and training metrics. Five case studies demonstrate the agent's ability to perform one-prompt performance profiling, diagnose RDMA degradation, identify heartbeat false-positives, uncover subtle checkpoint restore leaks, and differentiate between system and model-behavior-driven throughput declines, all within minutes.
Key takeaway
For AI/ML Engineers managing large-scale LLM training on AMD Instinct clusters, integrating MaxText-Slurm's agentic diagnosis can significantly reduce debugging time. You should set up the AI agent within your training environment to leverage its autonomous log triage, performance profiling, and TSDB analysis capabilities. This approach allows for rapid identification of root causes, whether system-level issues like RDMA degradation or training-specific problems like MoE load balance shifts, freeing your team to focus on optimization rather than manual troubleshooting.
Key insights
AI agents with tool access can autonomously diagnose complex LLM training issues using structured diagnostic skills and unified telemetry.
Principles
- Unified telemetry is critical for cross-domain hypothesis testing.
- Structured runbooks enable systematic, repeatable diagnostics.
- Agentic diagnosis can rule out system causes to pinpoint model behavior.
Method
The diagnostic pipeline uses `job-log-triage` to classify failures, `tsdb-diagnosis` for system-level root causes, and `performance-analysis` for compute-level profiling, with automatic handoff between skills.
In practice
- Run AI agents inside containers on cluster login nodes for secure access.
- Bind-mount repo directories for job output accessibility.
- Use `Ctrl+P, Ctrl+Q` to detach from containers without stopping them.
Topics
- Agentic Diagnosis
- LLM Training
- Observability Stack
- Prometheus TSDB
- AMD Instinct GPUs
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.