Tools to Improve Training Data
Summary
The "Talking Language AI" series features Vincent Warmerdam, a Machine Learning Engineer at Explosion (creators of spaCy and Prodigy), discussing tools to improve training data quality. He highlights critical data quality issues, such as a 212-story house in Flight Simulator due to an OpenStreetMap error, missing annotations in the Udacity self-driving car dataset, duplicate images in CIFAR-10, and mislabeled examples in the Google Emotions dataset. Warmerdam argues that traditional metrics like accuracy can be misleading with poor data. He introduces four human-in-the-loop tools: `human-learn` for rule-based modeling, `doubtlab` for identifying bad labels, `embetter` for scikit-learn compatible embeddings, and `bulk` for visual topic discovery and subset creation from unlabeled data using UMAP. These tools aim to make data quality iteration more accessible and effective.
Key takeaway
For Data Scientists and ML Engineers building or maintaining models, proactively addressing data quality is crucial. Your accuracy metrics might be high, but underlying data issues can lead to unreliable predictions in production. Implement tools like `doubtlab` to systematically identify and prioritize mislabeled data for re-annotation, and use `bulk` with `embetter` to efficiently create high-quality labeled subsets from unlabeled data, especially at project inception. This iterative approach ensures more robust models and a deeper understanding of your dataset.
Key insights
Data quality issues can render high model accuracy meaningless, necessitating human-in-the-loop tools for robust ML.
Principles
- Metrics alone are insufficient for data quality.
- Combine rule-based and ML systems to find discrepancies.
- Iterate on data, not just models.
Method
`doubtlab` identifies bad labels by flagging examples where models are uncertain, confident in wrong classes, or disagree, then prioritizes based on overlap. `bulk` uses embeddings and UMAP for visual topic discovery and subset selection.
In practice
- Use `human-learn` to establish strong rule-based baselines.
- Apply `doubtlab` to prioritize data for re-annotation.
- Use `embetter` for quick few-shot classification.
Topics
- Data Quality
- Machine Learning Engineering
- Human-in-the-Loop AI
- Scikit-learn
- Embeddings
- Data Annotation
- UMAP
Best for: Machine Learning Engineer, Data Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.