When eight popular large language models (LLMs) answer common parent questions about early maxillary expansion, how reliable are they—and how readable is what they say?

2025-11-28 · Source: Pascal’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI in Healthcare · Depth: Intermediate, medium

Summary

A study evaluated eight large language models (LLMs)—DeepSeek V3, Gemini 2.5 Flash, Claude 4.5 Sonnet, MediSearch, Copilot, GPT-5, GPT-4o, and Grok—on their reliability and readability when answering 20 common parent questions about early maxillary expansion in children. Researchers, including four orthodontists, scored responses for accuracy against scientific evidence and clinical practice, and for comprehensiveness. Readability was assessed using Flesch Reading Ease (FRES) and Flesch-Kincaid Grade Level (FKGL). Key findings indicate significant differences between models, with DeepSeek V3 and Grok performing best in clinical accuracy and comprehensiveness. Copilot, GPT-5, and GPT-4o were most readable but less accurate. Notably, no model achieved the recommended 6th-grade reading level for patient education materials, and MediSearch performed worst overall.

Key takeaway

CTOs and VPs of Engineering deploying LLMs in healthcare contexts must recognize that model choice is a material variable impacting accuracy, comprehensiveness, and readability. You should implement governance for approved models and use cases, separating educational content from decision support. Furthermore, plan for ongoing post-deployment evaluation and human-in-the-loop review, especially for pediatric or high-risk applications, to mitigate the risks of misleading information and delayed care.

Key insights

LLM performance in healthcare varies significantly across accuracy, comprehensiveness, and readability, with no single model excelling in all.

Principles

Model choice strongly shapes LLM output quality.
Readability often trades off with clinical reliability.
Plausible confidence from LLMs can be a healthcare hazard.

Method

The study used 20 real-world parent questions about early maxillary expansion, scored by orthodontists for accuracy and comprehensiveness, and measured readability with FRES and FKGL.

In practice

Evaluate LLMs using real-world queries and users.
Prioritize "accuracy-preserving simplification" in health LLMs.
Implement versioning and transparency for LLM updates.

Topics

Large Language Models
Healthcare AI
AI Reliability
Patient Education
Clinical Decision Support

Best for: CTO, VP of Engineering/Data, Executive, AI Engineer, Policy Maker, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pascal’s Substack.