LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
Summary
A controlled experiment evaluated 12 multi-agent LLM collaboration topologies for software architecture design, employing a $2\times2\times2$ factorial design across 520 runs and 8 design tasks. Designs were assessed using a 12-dimensional rubric by three independent automated evaluators: GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6. Four core findings emerged: structural adversarial (v4b), a prompt-engineered variant demanding rewrite mandates, ranked #1 with a weighted ensemble score of 4.637/5.0. Cross-model review, where one model generates and another reviews, secured the #2 spot at 4.606. The study also highlighted evaluator diversity, noting agreement on top and bottom performers (v4b and v3) but sharp disagreement on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), indicating varied weighting of design qualities across model families. Finally, parallel merge variants were deemed fundamentally broken, consistently ranking in the bottom tier (3.65-3.79) due to token starvation and the "Frankenstein effect."
Key takeaway
For AI Scientists and Research Scientists designing multi-agent LLM systems for software architecture, prioritize collaboration topologies that enforce structural adversarial rewrites or cross-model review. Your systems will likely achieve higher design quality by demanding comprehensive rewrites rather than iterative patches, and by leveraging distinct LLMs for generation and evaluation. Critically, avoid parallel merge strategies, as they consistently lead to token starvation and suboptimal "Frankenstein effect" designs, hindering overall system performance and output coherence.
Key insights
Structural adversarial and cross-model review topologies significantly enhance multi-agent LLM software design.
Principles
- Demanding rewrites over patches improves LLM design quality.
- Diverse LLM evaluators reveal varied design quality weighting.
- Parallel merging in LLM collaboration leads to token starvation.
Method
A $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics) was used to evaluate 12 multi-agent LLM collaboration topologies across 520 runs and 8 design tasks, with evaluation by multiple LLM models.
In practice
- Implement prompt-engineered adversarial variants for design refinement.
- Use distinct LLMs for generation and review tasks.
- Avoid parallel merge strategies in multi-agent LLM workflows.
Topics
- Multi-agent LLMs
- Software Architecture Design
- Collaboration Topologies
- Prompt Engineering
- LLM Evaluation
- Claude Opus
- GPT-OSS
Best for: AI Architect, AI Engineer, Machine Learning Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.