Developing and evaluating a chatbot to support maternal health care
Summary
A new chatbot designed to support maternal health care in India has been developed through a collaboration involving academic researchers, a health tech company, a public health nonprofit, and a hospital. This system addresses challenges like short, underspecified, and code-mixed user queries, the need for regional context, and difficult safe routing decisions due to partial symptom context. The chatbot integrates stage-aware triage to route high-risk queries to expert templates, hybrid retrieval over curated maternal/newborn guidelines, and evidence-conditioned generation from a Large Language Model (LLM). A core contribution is an evaluation workflow for high-stakes deployment with limited expert supervision, including a labeled triage benchmark (N=150) achieving 86.7% emergency recall, a synthetic multi-evidence retrieval benchmark (N=100), LLM-as-judge comparison on real queries (N=781), and expert validation.
Key takeaway
For AI Architects designing health chatbots for low-resource, multilingual environments, prioritize a "defense-in-depth" system architecture. Your design should integrate stage-aware triage, hybrid retrieval, and evidence-conditioned generation. Crucially, implement a multi-method evaluation workflow that includes both component-level and end-to-end testing, explicitly measuring trade-offs like missed emergencies versus over-escalation to ensure safety and trustworthiness in deployment.
Key insights
Trustworthy medical chatbots in noisy, multilingual settings require defense-in-depth design and multi-method evaluation.
Principles
- Combine triage, retrieval, and generation.
- Ground answers in regional context.
- Evaluate with component and end-to-end tests.
Method
The proposed evaluation workflow includes a labeled triage benchmark, a synthetic multi-evidence retrieval benchmark, LLM-as-judge comparisons using clinician-codesigned criteria, and expert validation for high-stakes deployments.
In practice
- Use stage-aware triage for high-risk queries.
- Implement hybrid retrieval for guidelines.
- Employ LLM-as-judge for query evaluation.
Topics
- Maternal Health Chatbot
- Low-Resource Settings
- LLM Evaluation
- Hybrid Retrieval
- Triage Systems
Best for: AI Scientist, Research Scientist, AI Architect, AI Engineer, AI Researcher, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.