MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences
Summary
MedArena is an interactive evaluation platform designed to compare large language models (LLMs) for medical applications based on real-world clinician preferences, addressing limitations of static benchmarks. The platform allows clinicians to submit their own medical queries and then select a preferred response from two randomly chosen LLM outputs. As of November 1, 2025, MedArena collected 1571 preferences across 12 LLMs, with Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o emerging as the top three models by Bradley-Terry rating. The study found that only one-third of clinician questions were factual recall tasks, while the majority involved treatment selection, clinical documentation, or patient communication, with approximately 20% being multi-turn conversations. Clinicians prioritized depth, detail, and clarity over raw factual accuracy, indicating the importance of readability and clinical nuance in medical LLM responses.
Key takeaway
For AI Engineers developing medical LLMs, traditional factual recall benchmarks are insufficient for assessing real-world clinical utility. You should focus on optimizing models for depth, detail, and clarity in responses, as these factors are crucial for clinician preference in tasks like treatment selection and patient communication. Incorporate interactive evaluation methods like MedArena to gather direct clinician feedback and refine model performance beyond mere accuracy.
Key insights
Real-world clinician preferences reveal that medical LLM utility extends beyond factual recall to include depth, clarity, and clinical nuance.
Principles
- Static benchmarks fail to capture real-world clinical utility.
- Clinician preference prioritizes response quality over raw accuracy.
Method
MedArena uses an interactive platform where clinicians submit queries and select preferred LLM responses from a pair, enabling direct comparison and preference collection for medical LLMs.
In practice
- Evaluate medical LLMs using interactive, clinician-driven platforms.
- Prioritize response depth, detail, and clarity in medical LLM development.
Topics
- Medical LLMs
- LLM Evaluation
- Clinician Preferences
- MedArena Platform
- Clinical Decision Support
Best for: AI Engineer, NLP Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Domain Expert
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.