Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs
Summary
Phoenix is a multi-agent LLM system designed for safe, end-to-end GitHub issue resolution, managing tasks from triage to pull-request creation. It integrates seven layered safety controls and a baseline-aware test evaluation strategy. The system employs six specialized agents—Planner, Reproducer, Coder, Tester, Failure Analyst, and PR agent—orchestrated by a label-based GitHub webhook state machine. On a 24-instance slice of SWE-bench Lite, Phoenix achieved a 75% oracle resolution rate with no pass-to-pass regressions on successful runs, averaging 170 seconds. A complementary pilot on 42 real issues across 14 repositories demonstrated 100% correctness preservation, with a mean resolution time of 122 seconds for hard-tier issues. However, manual inspection revealed that approximately half of the generated pull requests placed code at incorrect paths, a limitation attributed to the Planner's localization.
Key takeaway
For MLOps Engineers deploying LLM agents for automated code modification, you should prioritize correctness preservation and robust safety mechanisms over maximizing raw resolution rates. Implement layered safety controls, such as content sanitization and token refresh, derived from observed deployment failures. Your evaluation strategy must include baseline-aware testing to accurately assess changes in environments with pre-existing CI issues, ensuring new regressions are not introduced.
Key insights
Phoenix prioritizes correctness-first GitHub issue resolution using a multi-agent LLM system with layered safety controls.
Principles
- Prioritize correctness preservation in autonomous code modification.
- Decompose complex tasks into specialized, narrowly scoped agents.
- Derive safety mechanisms from observed deployment failure modes.
Method
A six-agent pipeline (Planner, Reproducer, Coder, Tester, Failure Analyst, PR Agent) is orchestrated by a label-based GitHub state machine, employing baseline-aware test evaluation.
In practice
- Implement baseline-aware testing for repositories with pre-existing CI failures.
- Sanitize large issue bodies to avoid LLM API gateway WAF filtering.
- Proactively refresh GitHub App tokens for long-running operations.
Topics
- Multi-Agent LLMs
- GitHub Automation
- Automated Program Repair
- AI Safety
- SWE-bench
- MLOps Deployment
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.