Finetuning and Bulk Labelling Images with Prodigy

2022-10-27 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This content explores bulk labeling images with Prodigy, demonstrating how to generate meaningful image embeddings for efficient data annotation. Initially, it showcases color histograms for embedding, illustrating its effectiveness with emoji images but revealing its limitations for complex datasets like pets. The analysis then shifts to using pre-trained Convolutional Neural Networks (CNNs) for embedding, comparing models like "exception" and "mobilenet" and noting "exception"'s superior performance in creating distinct clusters. The core technique involves fine-tuning a pre-trained CNN by adding a new task head and training it with a small set of Prodigy-annotated examples, such as 50 cat and dog images. This fine-tuning dramatically enhances embedding quality, resulting in significantly clearer clusters in the bulk labeling interface, which facilitates rapid selection of specific categories or ambiguous instances for further annotation.

Key takeaway

For Machine Learning Engineers aiming to efficiently label large image datasets, implementing a fine-tuning step with a pre-trained CNN and a small set of Prodigy annotations is crucial. This approach dramatically improves embedding quality, enabling clearer clustering and faster selection of relevant or ambiguous examples in your bulk labeling workflow. Prioritize researching robust pre-trained models and integrate this human-in-the-loop technique early to accelerate your data annotation process and build better initial models.

Key insights

Fine-tuning pre-trained models with minimal annotations significantly improves image embeddings for more effective bulk labeling.

Principles

Fine-tuning a pre-trained model with a new task head can steer embeddings towards specific interests.
The choice of pre-trained model (size, training domain) significantly impacts embedding quality.
Bulk labeling serves as a pre-processing step, not a source of perfect labels.

Method

Attach a new task head to a frozen pre-trained CNN, train the head with a small set of Prodigy-annotated examples, then use the intermediate layer's output as improved embeddings for UMAP visualization and bulk labeling.

In practice

Utilize Prodigy's flexible "mark" recipe for diverse image annotation tasks.
Research pre-trained models for domain relevance and size, favoring larger models like "exception" for accuracy.
Anticipate some errors in bulk-labeled selections, using them as a starting point for further annotation.

Topics

Bulk Labeling
Image Classification
Prodigy
Convolutional Neural Networks
Fine-tuning
Image Embeddings
UMAP

Best for: Machine Learning Engineer, AI Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.