Half of AI health answers are wrong even though they sound convincing – new study
Summary
A systematic health-information stress test conducted by seven researchers evaluated five popular chatbots: ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Published in BMJ Open, the study involved asking each chatbot 50 health and medical questions across topics like cancer, vaccines, stem cells, nutrition, and athletic performance. Experts rated nearly 20% of the answers as highly problematic, half as problematic, and 30% as somewhat problematic. Grok performed worst with 58% problematic responses, followed by ChatGPT at 52% and Meta AI at 50%. While chatbots handled vaccines and cancer best, open-ended questions proved most challenging, with 32% rated highly problematic. Reference lists were unreliable, with a median completeness score of just 40% and no chatbot producing a single fully accurate list across 25 attempts, often fabricating papers or links.
Key takeaway
For healthcare professionals and AI product managers evaluating chatbot integration, recognize that current free-tier models are unreliable for direct medical advice. Your teams should prioritize robust verification mechanisms and user education on chatbot limitations, especially for open-ended health queries. Do not position these tools as standalone medical authorities, but rather as supplementary aids requiring human oversight and external validation to mitigate significant health risks.
Key insights
Chatbots frequently provide problematic and unreliable health information, especially for open-ended queries and reference validation.
Principles
- Language models predict words, not facts.
- Training data quality impacts output reliability.
- Open-ended queries increase problematic responses.
Method
Researchers used a "red teaming" stress-testing technique, deliberately crafting prompts to elicit misleading answers from chatbots, then had two experts independently rate each response for accuracy and reliability.
In practice
- Verify chatbot health claims independently.
- Treat chatbot references as suggestions to check.
- Note confident responses lacking disclaimers.
Topics
- AI Chatbots
- Health Information Accuracy
- Medical Misinformation
- Large Language Models
- Reference Reliability
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, General Interest
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial intelligence (AI) – The Conversation.