Finding Bad Image Data using UMAP and Prodigy
Summary
This content details a pragmatic approach for identifying and filtering "bad examples" within large training datasets, using the Google Quick Draw dataset of 50 million drawings across 345 categories as a case study. The method combines UMAP for dimensionality reduction with human-in-the-loop annotation via Prodigy, an Explosion-developed data labeling tool. By encoding images as numeric vectors and applying UMAP, the process generates 2D scatter plots that reveal distinct "side clusters" likely containing misdrawn or irrelevant examples, such as non-Eiffel Tower images within the Eiffel Tower category. These suspicious examples are then fed into a customized Prodigy environment, which supports both single-image and captcha-like grid labeling interfaces, the latter proving significantly faster (200 images in 1 minute vs. 5 minutes) despite lacking individual skip functionality. The workflow emphasizes data quality and offers a repeatable pattern for exploring datasets.
Key takeaway
For Data Scientists or MLOps Engineers building models on user-generated or crowdsourced data, you should proactively implement data quality checks. This UMAP-Prodigy workflow offers a pragmatic pattern to efficiently identify and filter problematic training examples, preventing model degradation. Consider customizing labeling interfaces and integrating duplicate prevention to optimize annotation speed and accuracy, ensuring your models are trained on reliable data.
Key insights
UMAP-driven dimensionality reduction combined with human-in-the-loop annotation effectively identifies bad examples in large image datasets.
Principles
- Data quality is paramount for ML training.
- Human-in-the-loop improves data verification.
- Dimensionality reduction reveals data anomalies.
Method
Encode images as numeric vectors, apply UMAP for 2D dimensionality reduction, plot scatter charts to identify anomalous clusters, then use a custom Prodigy setup for targeted manual labeling and duplicate prevention.
In practice
- Use UMAP to find clusters of dissimilar images.
- Customize Prodigy recipes for specific labeling tasks.
- Implement duplicate prevention in labeling workflows.
Topics
- Data Quality
- Training Data
- UMAP
- Dimensionality Reduction
- Prodigy
- Data Annotation
- Machine Learning Engineering
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.