Training an insults classifier with Prodigy in ~1 hour
Summary
Explosion AI's Prodigy tool can train an insults classifier in approximately one hour, as demonstrated by its co-founder Ines. The process involves starting with four seed terms to bootstrap a terminology list of over fifty insults using GloVe word vectors and the "terms.teach" recipe. Subsequently, 500 Reddit comments are annotated as training examples for the classifier, leveraging Prodigy's active learning approach with the "textcat.teach" recipe. The model is then trained using "textcat.batch-train" on a spaCy English model with vectors, achieving 85% accuracy against a baseline. The resulting compact model, compatible with spaCy v2, can be loaded for real-time inference, though initial testing revealed limitations with self-deprecating insults.
Key takeaway
For Machine Learning Engineers tasked with rapidly prototyping text classification models for content moderation or similar applications, you should consider integrating active learning tools like Prodigy. This approach allows you to quickly move from concept to a working prototype, achieving viable accuracy (e.g., 85% in one hour) with minimal initial data. You can iterate efficiently on both code and data, significantly reducing the time required to validate feature feasibility and build initial models.
Key insights
Prodigy's active learning approach rapidly trains text classifiers by efficiently guiding human annotation with model predictions.
Principles
- Active learning focuses annotation on uncertain examples.
- Simple, binary interfaces enhance annotation speed and quality.
- Bootstrapping terminology with word vectors is efficient.
Method
Bootstrap terminology using "terms.teach" with seed terms and word vectors. Annotate training examples with "textcat.teach" using active learning. Batch train the model with "textcat.batch-train" and evaluate.
In practice
- Train sentiment analysis or chatbot intent models.
- Filter long texts with custom recipes for efficiency.
- Use rule-based matchers to refine training data.
Topics
- Prodigy
- Active Learning
- Text Classification
- Insults Classifier
- spaCy
- Content Moderation
Best for: Data Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.