There are more AI health tools than ever—but how well do they work?
Summary
Microsoft recently launched Copilot Health, integrating medical records for user queries, while Amazon made its LLM-based Health AI widely available, joining OpenAI's ChatGPT Health and Anthropic's Claude in a growing trend of consumer-facing AI health tools. This surge is driven by both advancements in generative AI, enabling better health question responses, and significant user demand, with Microsoft reporting 50 million daily health questions on Copilot. While these tools could improve healthcare access and potentially aid in triage by helping users decide on medical attention, experts like those at Mount Sinai and Oxford Internet Institute emphasize the critical need for rigorous, independent evaluation to ensure safety and efficacy before widespread public release. Current company-led benchmarks, such as OpenAI's HealthBench, show progress but have limitations, and studies like Google's AMIE demonstrate the potential of medical LLMs in controlled settings, though Google is not rushing its public release.
Key takeaway
For AI Product Managers developing health-oriented LLMs, you should prioritize independent, third-party evaluation and robust human-centric testing before public release. While internal benchmarks like HealthBench are useful, external validation, potentially through frameworks like MedHELM or controlled human studies, is essential to build trust and mitigate risks associated with diagnosis or treatment advice, especially given the ease with which users might ignore disclaimers.
Key insights
The rapid release of AI health chatbots necessitates rigorous, independent evaluation to ensure safety and efficacy.
Principles
- Demand for AI health tools is high due to healthcare access issues.
- Independent evaluation is crucial for high-stakes AI applications.
- User medical expertise impacts AI health tool effectiveness.
Method
Google's AMIE study involved patients discussing medical concerns with an LLM before seeing a physician, demonstrating a method for evaluating AI diagnostic accuracy and safety in a controlled, human-centric setting.
In practice
- Use MedHELM framework for comprehensive LLM medical task evaluation.
- Design benchmarks for multi-turn health conversations.
- Prioritize independent, third-party AI health tool assessments.
Topics
- AI Health Chatbots
- Large Language Models
- Independent Medical Evaluation
- Healthcare Access
- Medical Triage
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Product Manager, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.