AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering
Summary
AgentForge is a new multi-agent framework designed for autonomous software engineering that enforces execution-grounded verification for every code change. It addresses the limitations of large language models (LLMs) in real-world software tasks by integrating a Planner, Coder, Tester, Debugger, and Critic agent, all coordinating through shared memory and a mandatory Docker sandbox. The framework formalizes LLM-based software engineering as an iterative decision process over repository states, utilizing execution feedback as a stronger supervision signal than next-token likelihood. AgentForge achieved a 40.0% resolution rate on SWE-bench Lite, significantly outperforming single-agent baselines by 26-28 points. Ablation studies confirmed that both execution feedback and role decomposition independently contribute to its performance. The system is open-source and uses GPT-4o with a debug loop capped at three attempts, retrieving five past tasks and five repository files for context.
Key takeaway
For research scientists developing autonomous software engineering agents, AgentForge demonstrates that mandating sandboxed execution and employing a multi-agent architecture with specialized roles dramatically improves bug resolution rates. You should prioritize integrating robust execution feedback loops and structured agent decomposition into your designs, as these factors proved more critical than raw model scale for achieving reliable performance on benchmarks like SWE-bench Lite. Consider open-sourcing your frameworks to foster further research in execution-grounded agents.
Key insights
Execution-grounded verification and specialized multi-agent decomposition significantly enhance LLM performance in autonomous software engineering.
Principles
- Execution feedback provides a superior supervision signal for functional correctness.
- Decomposing tasks into specialized agents reduces error accumulation.
- Dual retrieval (episodic memory + repository index) improves grounding.
Method
AgentForge employs a five-agent pipeline (Planner, Coder, Tester, Debugger, Critic) with shared memory and mandatory Docker sandboxed execution for every code patch, formalizing the process as an MDP.
In practice
- Use Docker sandboxes for verified code execution.
- Implement distinct agents for planning, coding, testing, and debugging.
- Integrate episodic memory and live repository indexing for context.
Topics
- Multi-Agent LLM Frameworks
- Autonomous Software Engineering
- Execution-Grounded Verification
- Docker Sandboxing
- SWE-bench Lite
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.