There are more AI health tools than ever—but how well do they work?

2026-03-30 · Source: MIT Technology Review · Field: Health & Wellbeing — Medical Devices & Health Technology, Healthcare Systems & Policy · Depth: Intermediate, medium

Summary

Microsoft recently launched Copilot Health, integrating medical records for user queries, while Amazon made its LLM-based Health AI widely available, joining OpenAI's ChatGPT Health and Anthropic's Claude in a growing trend of consumer-facing AI health tools. This surge is driven by both advancements in generative AI, enabling better health question responses, and significant user demand, with Microsoft reporting 50 million daily health questions on Copilot. While these tools could improve healthcare access and potentially aid in triage by helping users decide on medical attention, experts like those at Mount Sinai and Oxford Internet Institute emphasize the critical need for rigorous, independent evaluation to ensure safety and efficacy before widespread public release. Current company-led benchmarks, such as OpenAI's HealthBench, show progress but have limitations, and studies like Google's AMIE demonstrate the potential of medical LLMs in controlled settings, though Google is not rushing its public release.

Key takeaway

For AI Product Managers developing health-oriented LLMs, you should prioritize independent, third-party evaluation and robust human-centric testing before public release. While internal benchmarks like HealthBench are useful, external validation, potentially through frameworks like MedHELM or controlled human studies, is essential to build trust and mitigate risks associated with diagnosis or treatment advice, especially given the ease with which users might ignore disclaimers.

Key insights

The rapid release of AI health chatbots necessitates rigorous, independent evaluation to ensure safety and efficacy.

Principles

Demand for AI health tools is high due to healthcare access issues.
Independent evaluation is crucial for high-stakes AI applications.
User medical expertise impacts AI health tool effectiveness.

Method

Google's AMIE study involved patients discussing medical concerns with an LLM before seeing a physician, demonstrating a method for evaluating AI diagnostic accuracy and safety in a controlled, human-centric setting.

In practice

Use MedHELM framework for comprehensive LLM medical task evaluation.
Design benchmarks for multi-turn health conversations.
Prioritize independent, third-party AI health tool assessments.

Topics

AI Health Chatbots
Large Language Models
Independent Medical Evaluation
Healthcare Access
Medical Triage

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Product Manager, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.