AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

2024-08-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

AgentForge is a new multi-agent framework designed for autonomous software engineering that enforces execution-grounded verification for every code change. It addresses the limitations of large language models (LLMs) in real-world software tasks by integrating a Planner, Coder, Tester, Debugger, and Critic agent, all coordinating through shared memory and a mandatory Docker sandbox. The framework formalizes LLM-based software engineering as an iterative decision process over repository states, utilizing execution feedback as a stronger supervision signal than next-token likelihood. AgentForge achieved a 40.0% resolution rate on SWE-bench Lite, significantly outperforming single-agent baselines by 26-28 points. Ablation studies confirmed that both execution feedback and role decomposition independently contribute to its performance. The system is open-source and uses GPT-4o with a debug loop capped at three attempts, retrieving five past tasks and five repository files for context.

Key takeaway

For research scientists developing autonomous software engineering agents, AgentForge demonstrates that mandating sandboxed execution and employing a multi-agent architecture with specialized roles dramatically improves bug resolution rates. You should prioritize integrating robust execution feedback loops and structured agent decomposition into your designs, as these factors proved more critical than raw model scale for achieving reliable performance on benchmarks like SWE-bench Lite. Consider open-sourcing your frameworks to foster further research in execution-grounded agents.

Key insights

Execution-grounded verification and specialized multi-agent decomposition significantly enhance LLM performance in autonomous software engineering.

Principles

Execution feedback provides a superior supervision signal for functional correctness.
Decomposing tasks into specialized agents reduces error accumulation.
Dual retrieval (episodic memory + repository index) improves grounding.

Method

AgentForge employs a five-agent pipeline (Planner, Coder, Tester, Debugger, Critic) with shared memory and mandatory Docker sandboxed execution for every code patch, formalizing the process as an MDP.

In practice

Use Docker sandboxes for verified code execution.
Implement distinct agents for planning, coding, testing, and debugging.
Integrate episodic memory and live repository indexing for context.

Topics

Multi-Agent LLM Frameworks
Autonomous Software Engineering
Execution-Grounded Verification
Docker Sandboxing
SWE-bench Lite

Code references

raja21068/AutoCodeAI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.