When his hobbies went on hiatus, this Kaggler made fighting COVID-19 with data his mission | A…

2020-07-29 · Source: Kaggle Blog - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

David Mezzetti, founder of NeuML, contributed to the Kaggle CORD-19 challenge by developing a solution to help researchers navigate the extensive COVID-19 literature. His approach combined a sentence embeddings-based search index with a custom BERT QA model to extract specific answers and generate summary tables for predefined questions. Mezzetti's background in ETL, data engineering, and NLP, along with an existing codebase from a prior project (codequestion), informed his strategy. The solution involved an ETL process to load the CORD-19 dataset into a SQLite database, breaking text into sentences, and mapping them to embeddings using BM25 + fastText. A Random Forest classifier was also built to determine study design, enhancing the value of search results by prioritizing more robust research.

Key takeaway

For data scientists working with large, dynamic text datasets, consider integrating advanced search and question-answering systems. Your team should prioritize understanding data strengths and weaknesses through iterative exploration and expert collaboration before committing to specific machine learning models. This approach ensures the developed solution directly addresses user needs and provides actionable insights, rather than just raw data.

Key insights

Combining sentence embeddings with a custom BERT QA model and study design classification enhances medical document search and information extraction.

Principles

Prioritize study design in medical literature search.
Iterative data exploration and expert feedback are crucial.
Data preparation and feature engineering consume significant time.

Method

The method involves an ETL process to load text into SQLite, breaking text into sentences, mapping sentences to embeddings via BM25 + fastText, and using a custom BERT QA model for answer extraction.

In practice

Use sentence embeddings for document search.
Implement BERT QA models for precise answer extraction.
Classify study design to filter research quality.

Topics

Kaggle CORD-19 Challenge
Sentence Embeddings
BERT QA Model
Natural Language Processing
Data Engineering

Code references

Best for: Data Scientist, AI Data Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Kaggle Blog - Medium.