Training an insults classifier with Prodigy in ~1 hour

2017-09-06 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

Explosion AI's Prodigy tool can train an insults classifier in approximately one hour, as demonstrated by its co-founder Ines. The process involves starting with four seed terms to bootstrap a terminology list of over fifty insults using GloVe word vectors and the "terms.teach" recipe. Subsequently, 500 Reddit comments are annotated as training examples for the classifier, leveraging Prodigy's active learning approach with the "textcat.teach" recipe. The model is then trained using "textcat.batch-train" on a spaCy English model with vectors, achieving 85% accuracy against a baseline. The resulting compact model, compatible with spaCy v2, can be loaded for real-time inference, though initial testing revealed limitations with self-deprecating insults.

Key takeaway

For Machine Learning Engineers tasked with rapidly prototyping text classification models for content moderation or similar applications, you should consider integrating active learning tools like Prodigy. This approach allows you to quickly move from concept to a working prototype, achieving viable accuracy (e.g., 85% in one hour) with minimal initial data. You can iterate efficiently on both code and data, significantly reducing the time required to validate feature feasibility and build initial models.

Key insights

Prodigy's active learning approach rapidly trains text classifiers by efficiently guiding human annotation with model predictions.

Principles

Active learning focuses annotation on uncertain examples.
Simple, binary interfaces enhance annotation speed and quality.
Bootstrapping terminology with word vectors is efficient.

Method

Bootstrap terminology using "terms.teach" with seed terms and word vectors. Annotate training examples with "textcat.teach" using active learning. Batch train the model with "textcat.batch-train" and evaluate.

In practice

Train sentiment analysis or chatbot intent models.
Filter long texts with custom recipes for efficiency.
Use rule-based matchers to refine training data.

Topics

Prodigy
Active Learning
Text Classification
Insults Classifier
spaCy
Content Moderation

Best for: Data Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.