Tools to Improve Training Data

2022-11-23 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

The "Talking Language AI" series features Vincent Warmerdam, a Machine Learning Engineer at Explosion (creators of spaCy and Prodigy), discussing tools to improve training data quality. He highlights critical data quality issues, such as a 212-story house in Flight Simulator due to an OpenStreetMap error, missing annotations in the Udacity self-driving car dataset, duplicate images in CIFAR-10, and mislabeled examples in the Google Emotions dataset. Warmerdam argues that traditional metrics like accuracy can be misleading with poor data. He introduces four human-in-the-loop tools: `human-learn` for rule-based modeling, `doubtlab` for identifying bad labels, `embetter` for scikit-learn compatible embeddings, and `bulk` for visual topic discovery and subset creation from unlabeled data using UMAP. These tools aim to make data quality iteration more accessible and effective.

Key takeaway

For Data Scientists and ML Engineers building or maintaining models, proactively addressing data quality is crucial. Your accuracy metrics might be high, but underlying data issues can lead to unreliable predictions in production. Implement tools like `doubtlab` to systematically identify and prioritize mislabeled data for re-annotation, and use `bulk` with `embetter` to efficiently create high-quality labeled subsets from unlabeled data, especially at project inception. This iterative approach ensures more robust models and a deeper understanding of your dataset.

Key insights

Data quality issues can render high model accuracy meaningless, necessitating human-in-the-loop tools for robust ML.

Principles

Metrics alone are insufficient for data quality.
Combine rule-based and ML systems to find discrepancies.
Iterate on data, not just models.

Method

`doubtlab` identifies bad labels by flagging examples where models are uncertain, confident in wrong classes, or disagree, then prioritizes based on overlap. `bulk` uses embeddings and UMAP for visual topic discovery and subset selection.

In practice

Use `human-learn` to establish strong rule-based baselines.
Apply `doubtlab` to prioritize data for re-annotation.
Use `embetter` for quick few-shot classification.

Topics

Data Quality
Machine Learning Engineering
Human-in-the-Loop AI
Scikit-learn
Embeddings
Data Annotation
UMAP

Best for: Machine Learning Engineer, Data Scientist, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.