Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Phoenix is a multi-agent LLM system designed for safe, end-to-end GitHub issue resolution, managing tasks from triage to pull-request creation. It integrates seven layered safety controls and a baseline-aware test evaluation strategy. The system employs six specialized agents—Planner, Reproducer, Coder, Tester, Failure Analyst, and PR agent—orchestrated by a label-based GitHub webhook state machine. On a 24-instance slice of SWE-bench Lite, Phoenix achieved a 75% oracle resolution rate with no pass-to-pass regressions on successful runs, averaging 170 seconds. A complementary pilot on 42 real issues across 14 repositories demonstrated 100% correctness preservation, with a mean resolution time of 122 seconds for hard-tier issues. However, manual inspection revealed that approximately half of the generated pull requests placed code at incorrect paths, a limitation attributed to the Planner's localization.

Key takeaway

For MLOps Engineers deploying LLM agents for automated code modification, you should prioritize correctness preservation and robust safety mechanisms over maximizing raw resolution rates. Implement layered safety controls, such as content sanitization and token refresh, derived from observed deployment failures. Your evaluation strategy must include baseline-aware testing to accurately assess changes in environments with pre-existing CI issues, ensuring new regressions are not introduced.

Key insights

Phoenix prioritizes correctness-first GitHub issue resolution using a multi-agent LLM system with layered safety controls.

Principles

Prioritize correctness preservation in autonomous code modification.
Decompose complex tasks into specialized, narrowly scoped agents.
Derive safety mechanisms from observed deployment failure modes.

Method

A six-agent pipeline (Planner, Reproducer, Coder, Tester, Failure Analyst, PR Agent) is orchestrated by a label-based GitHub state machine, employing baseline-aware test evaluation.

In practice

Implement baseline-aware testing for repositories with pre-existing CI failures.
Sanitize large issue bodies to avoid LLM API gateway WAF filtering.
Proactively refresh GitHub App tokens for long-running operations.

Topics

Multi-Agent LLMs
GitHub Automation
Automated Program Repair
AI Safety
SWE-bench
MLOps Deployment

Code references

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.