Please don’t trust your chatbot for medical advice

2025-06-07 · Source: Marcus on AI · Field: Health & Wellbeing — Clinical Care & Medical Practice, Medical Devices & Health Technology, Healthcare Systems & Policy · Depth: Novice, short

Summary

Recent studies highlight significant limitations and risks associated with using large language models (LLMs) for medical advice, particularly when accessed by the general public. A BMJ study involving Gemini, DeepSeek, Meta AI, ChatGPT, and Grok found nearly half of responses to medical questions were problematic, exhibiting hallucinations, fabricated citations, and overconfidence. Separately, research in JAMA Network Open assessed 21 frontier models, concluding they remain limited in early diagnostic reasoning and are unreliable for unsupervised patient-facing clinical decision-making. Two additional Nature Medicine studies reinforced these concerns, showing LLMs identified relevant conditions in fewer than 34.5% of cases and undertriaged 52% of gold-standard emergencies, raising critical safety issues for consumer-scale deployment.

Key takeaway

For healthcare providers and AI developers considering integrating LLMs into patient-facing applications, these converging studies underscore a critical need for caution. You should prioritize rigorous validation and robust human oversight, especially for diagnostic reasoning and triage systems, to prevent amplifying misinformation and ensure patient safety. Do not deploy consumer-scale AI triage without prospective validation of safety concerns.

Key insights

LLMs are unreliable for medical advice, frequently generating misinformation with overconfidence and lacking clinical reasoning.

Principles

LLMs are "frequently wrong, never in doubt."
LLM outputs are consistently expressed with confidence.
Patients struggle to guide LLMs effectively.

In practice

Avoid using LLMs for unsupervised patient-facing clinical decisions.
Educate the public on LLM limitations in healthcare.
Implement crisis safeguards before AI triage system deployment.

Topics

Large Language Models
Medical Misinformation
Clinical Reasoning
Patient Safety
Diagnostic Limitations

Best for: CTO, VP of Engineering/Data, Director of AI/ML, General Interest, AI Ethicist, Policy Maker

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Marcus on AI.