Databricks for Good and Virtue Foundation: Partnering to Connect Medical Volunteers to Critical Health Services in 72 Countries
Summary
Databricks for Good has partnered with Virtue Foundation since 2024 to enhance global health delivery by connecting medical volunteers to critical services in 72 low and low-middle income countries. They developed a Databricks-based platform that aggregates data from thousands of healthcare facilities and non-profits. The core Foundational Data Refresh (FDR) pipeline ingests data from Overture Maps and Bright Data, using OpenAI's GPT models to extract structured information from over 25 million web pages. This process employs targeted steps to reduce token consumption and is orchestrated by Lakeflow Jobs, leveraging Spark and Photon for scalable distributed processing. Entity resolution is handled by Splink, a probabilistic record linkage framework, which saw a 15x performance improvement with Photon. A prototype VF Agent, built with LangGraph and Databricks services, allows natural language queries for data analysis.
Key takeaway
For MLOps Engineers scaling LLM-powered data pipelines, this project demonstrates a robust architecture for handling messy, multi-terabyte web data. You should adopt multi-step LLM inference for precision and cost efficiency, integrate probabilistic record linkage for entity resolution, and leverage distributed processing with checkpointing for production-grade reliability. Consider Databricks' unified platform for orchestrating complex, interdependent tasks and achieving significant performance gains.
Key insights
Production-grade LLM pipelines and unified data platforms can transform disparate web data into actionable global health insights.
Principles
- Decomposing LLM extraction tasks into narrow steps optimizes token use and precision.
- Probabilistic record linkage effectively unifies messy entities across diverse data sources.
- Scalable distributed processing is essential for high-throughput LLM inference and data workloads.
Method
The Foundational Data Refresh (FDR) pipeline ingests web data, uses OpenAI GPT models for multi-step information extraction, and orchestrates with Databricks/Spark. Entity resolution employs Splink, and a multi-agent architecture (LangGraph) enables natural language querying.
In practice
- Break complex LLM extraction into classification and specific extraction steps.
- Implement status-based checkpointing to manage expensive LLM calls in pipelines.
- Utilize probabilistic record linkage for robust deduplication of real-world facility data.
Topics
- Global Health Initiatives
- LLM Data Extraction
- Databricks Platform
- Entity Resolution
- Apache Spark
- Multi-Agent AI
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.