LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

· Source: Artificial Intelligence · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A controlled experiment evaluated 12 multi-agent LLM collaboration topologies for software architecture design, employing a $2\times2\times2$ factorial design across 520 runs and 8 design tasks. Designs were assessed using a 12-dimensional rubric by three independent automated evaluators: GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6. Four core findings emerged: structural adversarial (v4b), a prompt-engineered variant demanding rewrite mandates, ranked #1 with a weighted ensemble score of 4.637/5.0. Cross-model review, where one model generates and another reviews, secured the #2 spot at 4.606. The study also highlighted evaluator diversity, noting agreement on top and bottom performers (v4b and v3) but sharp disagreement on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), indicating varied weighting of design qualities across model families. Finally, parallel merge variants were deemed fundamentally broken, consistently ranking in the bottom tier (3.65-3.79) due to token starvation and the "Frankenstein effect."

Key takeaway

For AI Scientists and Research Scientists designing multi-agent LLM systems for software architecture, prioritize collaboration topologies that enforce structural adversarial rewrites or cross-model review. Your systems will likely achieve higher design quality by demanding comprehensive rewrites rather than iterative patches, and by leveraging distinct LLMs for generation and evaluation. Critically, avoid parallel merge strategies, as they consistently lead to token starvation and suboptimal "Frankenstein effect" designs, hindering overall system performance and output coherence.

Key insights

Structural adversarial and cross-model review topologies significantly enhance multi-agent LLM software design.

Principles

Method

A $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics) was used to evaluate 12 multi-agent LLM collaboration topologies across 520 runs and 8 design tasks, with evaluation by multiple LLM models.

In practice

Topics

Best for: AI Architect, AI Engineer, Machine Learning Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.