MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, AI in Healthcare · Depth: Intermediate, short

Summary

MedArena is an interactive evaluation platform designed to compare large language models (LLMs) for medical applications based on real-world clinician preferences, addressing limitations of static benchmarks. The platform allows clinicians to submit their own medical queries and then select a preferred response from two randomly chosen LLM outputs. As of November 1, 2025, MedArena collected 1571 preferences across 12 LLMs, with Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o emerging as the top three models by Bradley-Terry rating. The study found that only one-third of clinician questions were factual recall tasks, while the majority involved treatment selection, clinical documentation, or patient communication, with approximately 20% being multi-turn conversations. Clinicians prioritized depth, detail, and clarity over raw factual accuracy, indicating the importance of readability and clinical nuance in medical LLM responses.

Key takeaway

For AI Engineers developing medical LLMs, traditional factual recall benchmarks are insufficient for assessing real-world clinical utility. You should focus on optimizing models for depth, detail, and clarity in responses, as these factors are crucial for clinician preference in tasks like treatment selection and patient communication. Incorporate interactive evaluation methods like MedArena to gather direct clinician feedback and refine model performance beyond mere accuracy.

Key insights

Real-world clinician preferences reveal that medical LLM utility extends beyond factual recall to include depth, clarity, and clinical nuance.

Principles

Method

MedArena uses an interactive platform where clinicians submit queries and select preferred LLM responses from a pair, enabling direct comparison and preference collection for medical LLMs.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Domain Expert

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.