Systematic debugging for AI agents: Introducing the AgentRx framework

2026-03-12 · Source: Microsoft Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Microsoft Research has open-sourced AgentRx, an automated, domain-agnostic framework designed to diagnose AI agent failures by pinpointing the "critical failure step" in complex agent trajectories. Debugging AI agents is challenging due to their long-horizon, probabilistic, and multi-agent nature, which often obscures root causes. AgentRx addresses this by synthesizing guarded, executable constraints from tool schemas and domain policies, then evaluating them step-by-step to log evidence-backed violations. The framework improves failure localization by +23.6% and root-cause attribution by +22.9% over prompting baselines. Alongside AgentRx, a benchmark dataset of 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, and a nine-category failure taxonomy, have also been released to foster more transparent and resilient agentic systems.

Key takeaway

For AI Architects and Research Scientists building autonomous AI systems, AgentRx offers a critical tool for improving agent reliability and transparency. Your teams should integrate AgentRx into development workflows to systematically diagnose failures, moving beyond trial-and-error prompting. This enables more robust agentic engineering and helps ensure agents are auditable and dependable for real-world deployment.

Key insights

AgentRx automates AI agent failure diagnosis by identifying the first unrecoverable error using structured constraint validation.

Principles

Validate agent execution like a system trace.
Identify the first unrecoverable error for root cause.
Use grounded taxonomies for failure categorization.

Method

AgentRx normalizes logs, synthesizes executable constraints from tool schemas and policies, evaluates them step-by-step to create an auditable log, and uses an LLM judge with a taxonomy to identify the critical failure step.

In practice

Use AgentRx to diagnose agentic workflows.
Contribute to the library of failure constraints.
Apply the nine-category failure taxonomy.

Topics

AI Agent Debugging
Failure Localization
Root Cause Analysis
Agent Benchmarking
Multi-agent Systems

Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.