Half of AI health answers are wrong even though they sound convincing – new study

2026-04-20 · Source: Artificial intelligence (AI) – The Conversation · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology · Depth: Novice, short

Summary

A systematic health-information stress test conducted by seven researchers evaluated five popular chatbots: ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Published in BMJ Open, the study involved asking each chatbot 50 health and medical questions across topics like cancer, vaccines, stem cells, nutrition, and athletic performance. Experts rated nearly 20% of the answers as highly problematic, half as problematic, and 30% as somewhat problematic. Grok performed worst with 58% problematic responses, followed by ChatGPT at 52% and Meta AI at 50%. While chatbots handled vaccines and cancer best, open-ended questions proved most challenging, with 32% rated highly problematic. Reference lists were unreliable, with a median completeness score of just 40% and no chatbot producing a single fully accurate list across 25 attempts, often fabricating papers or links.

Key takeaway

For healthcare professionals and AI product managers evaluating chatbot integration, recognize that current free-tier models are unreliable for direct medical advice. Your teams should prioritize robust verification mechanisms and user education on chatbot limitations, especially for open-ended health queries. Do not position these tools as standalone medical authorities, but rather as supplementary aids requiring human oversight and external validation to mitigate significant health risks.

Key insights

Chatbots frequently provide problematic and unreliable health information, especially for open-ended queries and reference validation.

Principles

Language models predict words, not facts.
Training data quality impacts output reliability.
Open-ended queries increase problematic responses.

Method

Researchers used a "red teaming" stress-testing technique, deliberately crafting prompts to elicit misleading answers from chatbots, then had two experts independently rate each response for accuracy and reliability.

In practice

Verify chatbot health claims independently.
Treat chatbot references as suggestions to check.
Note confident responses lacking disclaimers.

Topics

AI Chatbots
Health Information Accuracy
Medical Misinformation
Large Language Models
Reference Reliability

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, General Interest

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial intelligence (AI) – The Conversation.