There are more AI health tools than ever—but how well do they work?

· Source: MIT Technology Review · Field: Health & Wellbeing — Medical Devices & Health Technology, Healthcare Systems & Policy · Depth: Intermediate, medium

Summary

Microsoft recently launched Copilot Health, integrating medical records for user queries, while Amazon made its LLM-based Health AI widely available, joining OpenAI's ChatGPT Health and Anthropic's Claude in a growing trend of consumer-facing AI health tools. This surge is driven by both advancements in generative AI, enabling better health question responses, and significant user demand, with Microsoft reporting 50 million daily health questions on Copilot. While these tools could improve healthcare access and potentially aid in triage by helping users decide on medical attention, experts like those at Mount Sinai and Oxford Internet Institute emphasize the critical need for rigorous, independent evaluation to ensure safety and efficacy before widespread public release. Current company-led benchmarks, such as OpenAI's HealthBench, show progress but have limitations, and studies like Google's AMIE demonstrate the potential of medical LLMs in controlled settings, though Google is not rushing its public release.

Key takeaway

For AI Product Managers developing health-oriented LLMs, you should prioritize independent, third-party evaluation and robust human-centric testing before public release. While internal benchmarks like HealthBench are useful, external validation, potentially through frameworks like MedHELM or controlled human studies, is essential to build trust and mitigate risks associated with diagnosis or treatment advice, especially given the ease with which users might ignore disclaimers.

Key insights

The rapid release of AI health chatbots necessitates rigorous, independent evaluation to ensure safety and efficacy.

Principles

Method

Google's AMIE study involved patients discussing medical concerns with an LLM before seeing a physician, demonstrating a method for evaluating AI diagnostic accuracy and safety in a controlled, human-centric setting.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Product Manager, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.