Evaluating Multi-Agent LLM Architectures for Rare Disease Diagnosis
Summary
A study submitted on March 6, 2026, evaluated four multi-agent large language model (LLM) architectures for rare disease diagnosis across 302 cases spanning 33 rare disease categories. The architectures tested were Control (single agent), Hierarchical, Adversarial, and Collaborative. The Hierarchical topology achieved the highest diagnostic accuracy at 50.0%, slightly surpassing Collaborative (49.8%) and Control (48.5%). The Adversarial model performed significantly worse, reaching only 27.3% accuracy and exhibiting a large "Reasoning Gap" where correct diagnoses were rejected. Performance was strongest for Allergic diseases and Toxic Effects but weakest for Cardiac Malformation and Respiratory cases. Notably, all multi-agent systems, including the Adversarial model, showed superior accuracy in Bone and Thoracic disease categories compared to the single-agent baseline.
Key takeaway
For AI researchers developing diagnostic LLMs, these findings indicate that simply adding agents does not ensure better performance. You should prioritize dynamic topology selection, favoring Hierarchical or Collaborative designs, especially for Bone and Thoracic disease categories, while carefully avoiding Adversarial configurations that can introduce artificial doubt and reduce accuracy.
Key insights
Multi-agent LLM architectures do not guarantee improved diagnostic accuracy; topology selection is critical.
Principles
- Increased system complexity does not guarantee better reasoning.
- Adversarial topologies can degrade diagnostic performance.
Method
The study introduced a "Reasoning Gap" metric to quantify the difference between internal knowledge retrieval and final diagnostic accuracy in LLM-based diagnostic systems.
In practice
- Consider Hierarchical or Collaborative topologies for diagnosis.
- Avoid Adversarial topologies for critical diagnostic tasks.
Topics
- Multi-agent LLMs
- Rare Disease Diagnosis
- LLM Architectures
- Diagnostic Accuracy
- Reasoning Gap Metric
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.