Finding Bad Image Data using UMAP and Prodigy

2022-04-27 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This content details a pragmatic approach for identifying and filtering "bad examples" within large training datasets, using the Google Quick Draw dataset of 50 million drawings across 345 categories as a case study. The method combines UMAP for dimensionality reduction with human-in-the-loop annotation via Prodigy, an Explosion-developed data labeling tool. By encoding images as numeric vectors and applying UMAP, the process generates 2D scatter plots that reveal distinct "side clusters" likely containing misdrawn or irrelevant examples, such as non-Eiffel Tower images within the Eiffel Tower category. These suspicious examples are then fed into a customized Prodigy environment, which supports both single-image and captcha-like grid labeling interfaces, the latter proving significantly faster (200 images in 1 minute vs. 5 minutes) despite lacking individual skip functionality. The workflow emphasizes data quality and offers a repeatable pattern for exploring datasets.

Key takeaway

For Data Scientists or MLOps Engineers building models on user-generated or crowdsourced data, you should proactively implement data quality checks. This UMAP-Prodigy workflow offers a pragmatic pattern to efficiently identify and filter problematic training examples, preventing model degradation. Consider customizing labeling interfaces and integrating duplicate prevention to optimize annotation speed and accuracy, ensuring your models are trained on reliable data.

Key insights

UMAP-driven dimensionality reduction combined with human-in-the-loop annotation effectively identifies bad examples in large image datasets.

Principles

Data quality is paramount for ML training.
Human-in-the-loop improves data verification.
Dimensionality reduction reveals data anomalies.

Method

Encode images as numeric vectors, apply UMAP for 2D dimensionality reduction, plot scatter charts to identify anomalous clusters, then use a custom Prodigy setup for targeted manual labeling and duplicate prevention.

In practice

Use UMAP to find clusters of dissimilar images.
Customize Prodigy recipes for specific labeling tasks.
Implement duplicate prevention in labeling workflows.

Topics

Data Quality
Training Data
UMAP
Dimensionality Reduction
Prodigy
Data Annotation
Machine Learning Engineering

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.