LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

2026-05-31 · Source: Artificial Intelligence · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A controlled experiment evaluated 12 multi-agent LLM collaboration topologies for software architecture design, employing a $2\times2\times2$ factorial design across 520 runs and 8 design tasks. Designs were assessed using a 12-dimensional rubric by three independent automated evaluators: GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6. Four core findings emerged: structural adversarial (v4b), a prompt-engineered variant demanding rewrite mandates, ranked #1 with a weighted ensemble score of 4.637/5.0. Cross-model review, where one model generates and another reviews, secured the #2 spot at 4.606. The study also highlighted evaluator diversity, noting agreement on top and bottom performers (v4b and v3) but sharp disagreement on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), indicating varied weighting of design qualities across model families. Finally, parallel merge variants were deemed fundamentally broken, consistently ranking in the bottom tier (3.65-3.79) due to token starvation and the "Frankenstein effect."

Key takeaway

For AI Scientists and Research Scientists designing multi-agent LLM systems for software architecture, prioritize collaboration topologies that enforce structural adversarial rewrites or cross-model review. Your systems will likely achieve higher design quality by demanding comprehensive rewrites rather than iterative patches, and by leveraging distinct LLMs for generation and evaluation. Critically, avoid parallel merge strategies, as they consistently lead to token starvation and suboptimal "Frankenstein effect" designs, hindering overall system performance and output coherence.

Key insights

Structural adversarial and cross-model review topologies significantly enhance multi-agent LLM software design.

Principles

Demanding rewrites over patches improves LLM design quality.
Diverse LLM evaluators reveal varied design quality weighting.
Parallel merging in LLM collaboration leads to token starvation.

Method

A $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics) was used to evaluate 12 multi-agent LLM collaboration topologies across 520 runs and 8 design tasks, with evaluation by multiple LLM models.

In practice

Implement prompt-engineered adversarial variants for design refinement.
Use distinct LLMs for generation and review tasks.
Avoid parallel merge strategies in multi-agent LLM workflows.

Topics

Multi-agent LLMs
Software Architecture Design
Collaboration Topologies
Prompt Engineering
LLM Evaluation
Claude Opus
GPT-OSS

Best for: AI Architect, AI Engineer, Machine Learning Engineer, AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.