Tools to Improve Training Data

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

The "Talking Language AI" series features Vincent Warmerdam, a Machine Learning Engineer at Explosion (creators of spaCy and Prodigy), discussing tools to improve training data quality. He highlights critical data quality issues, such as a 212-story house in Flight Simulator due to an OpenStreetMap error, missing annotations in the Udacity self-driving car dataset, duplicate images in CIFAR-10, and mislabeled examples in the Google Emotions dataset. Warmerdam argues that traditional metrics like accuracy can be misleading with poor data. He introduces four human-in-the-loop tools: `human-learn` for rule-based modeling, `doubtlab` for identifying bad labels, `embetter` for scikit-learn compatible embeddings, and `bulk` for visual topic discovery and subset creation from unlabeled data using UMAP. These tools aim to make data quality iteration more accessible and effective.

Key takeaway

For Data Scientists and ML Engineers building or maintaining models, proactively addressing data quality is crucial. Your accuracy metrics might be high, but underlying data issues can lead to unreliable predictions in production. Implement tools like `doubtlab` to systematically identify and prioritize mislabeled data for re-annotation, and use `bulk` with `embetter` to efficiently create high-quality labeled subsets from unlabeled data, especially at project inception. This iterative approach ensures more robust models and a deeper understanding of your dataset.

Key insights

Data quality issues can render high model accuracy meaningless, necessitating human-in-the-loop tools for robust ML.

Principles

Method

`doubtlab` identifies bad labels by flagging examples where models are uncertain, confident in wrong classes, or disagree, then prioritizes based on overlap. `bulk` uses embeddings and UMAP for visual topic discovery and subset selection.

In practice

Topics

Best for: Machine Learning Engineer, Data Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.