Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
Summary
A study evaluated a "Large Language Model (LLM) Jury" as an alternative to expert clinician panels for assessing medical AI systems, aiming to address the high cost and slow pace of human evaluation. The LLM jury, comprising three frontier AI models (Anthropic's Claude Opus 4.1, Google's Gemini 2.5 Pro, and OpenAI's o3), scored 3333 diagnoses across 300 real-world hospital cases from a middle-income country. Performance was benchmarked against primary expert clinician panels and independent human re-scoring panels across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk (converted to patient safety). Key findings indicate that while uncalibrated LLM jury scores are systematically lower than clinician scores, the LLM jury preserves ordinal agreement, shows better concordance with primary expert panels than human re-score panels, and exhibits a lower probability of severe safety errors. Post-hoc calibration using isotonic regression significantly improved alignment with human expert evaluations, demonstrating the LLM jury's potential as a reliable proxy for expert clinician evaluation in medical AI benchmarking.
Key takeaway
For AI Scientists and Research Scientists developing or deploying medical AI, this study provides compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation. You should consider integrating such LLM juries into your evaluation pipelines to achieve scalable, efficient, and consistent assessment of medical AI outputs, especially for identifying high-risk diagnoses requiring targeted human review. This approach can significantly reduce the burden and cost associated with traditional human expert panels while maintaining or improving evaluation quality.
Key insights
A calibrated multi-LLM jury reliably proxies expert clinician evaluation for medical AI benchmarking, outperforming human re-scorers.
Principles
- LLM juries can preserve ordinal agreement with human experts.
- Calibration improves LLM jury alignment with human expert scores.
- Multi-model LLM juries can reduce severe error rates.
Method
A multi-model LLM jury evaluates medical diagnoses across four dimensions. Scores are aggregated and then calibrated using isotonic regression to align with expert human panel judgments, enhancing reliability and accuracy.
In practice
- Implement isotonic regression for LLM judge score calibration.
- Use multi-LLM juries to identify high-risk ward diagnoses.
- Prompt LLMs with patient age/demographics for nuanced risk assessment.
Topics
- Large Language Models
- Medical AI Evaluation
- LLM-as-a-Judge
- Clinical Diagnosis Scoring
- Isotonic Regression
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.