When eight popular large language models (LLMs) answer common parent questions about early maxillary expansion, how reliable are they—and how readable is what they say?
Summary
A study evaluated eight large language models (LLMs)—DeepSeek V3, Gemini 2.5 Flash, Claude 4.5 Sonnet, MediSearch, Copilot, GPT-5, GPT-4o, and Grok—on their reliability and readability when answering 20 common parent questions about early maxillary expansion in children. Researchers, including four orthodontists, scored responses for accuracy against scientific evidence and clinical practice, and for comprehensiveness. Readability was assessed using Flesch Reading Ease (FRES) and Flesch-Kincaid Grade Level (FKGL). Key findings indicate significant differences between models, with DeepSeek V3 and Grok performing best in clinical accuracy and comprehensiveness. Copilot, GPT-5, and GPT-4o were most readable but less accurate. Notably, no model achieved the recommended 6th-grade reading level for patient education materials, and MediSearch performed worst overall.
Key takeaway
CTOs and VPs of Engineering deploying LLMs in healthcare contexts must recognize that model choice is a material variable impacting accuracy, comprehensiveness, and readability. You should implement governance for approved models and use cases, separating educational content from decision support. Furthermore, plan for ongoing post-deployment evaluation and human-in-the-loop review, especially for pediatric or high-risk applications, to mitigate the risks of misleading information and delayed care.
Key insights
LLM performance in healthcare varies significantly across accuracy, comprehensiveness, and readability, with no single model excelling in all.
Principles
- Model choice strongly shapes LLM output quality.
- Readability often trades off with clinical reliability.
- Plausible confidence from LLMs can be a healthcare hazard.
Method
The study used 20 real-world parent questions about early maxillary expansion, scored by orthodontists for accuracy and comprehensiveness, and measured readability with FRES and FKGL.
In practice
- Evaluate LLMs using real-world queries and users.
- Prioritize "accuracy-preserving simplification" in health LLMs.
- Implement versioning and transparency for LLM updates.
Topics
- Large Language Models
- Healthcare AI
- AI Reliability
- Patient Education
- Clinical Decision Support
Best for: CTO, VP of Engineering/Data, Executive, AI Engineer, Policy Maker, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Pascal’s Substack.