Training a new entity type with Prodigy – annotation powered by active learning

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

Explosion AI's Prodigy tool, co-developed by spaCy's author, enables training new phrase classification systems from scratch, focusing on data creation. The process involves interactively building a terminology list using "prodigy terms.teach" and spaCy's "en_core_web_lg" model with GloVe vectors, exemplified by identifying drug terms from US opiate user community text. This initial list, generated from seed terms like "heroin," "benzos," and "weed," can yield 118 terms from 300 clicks in minutes. Subsequently, "prodigy terms.to-patterns" converts this list into patterns for Named Entity Recognition (NER). The "prodigy ner.teach" command then facilitates NER annotation, employing active learning and uncertainty sampling to efficiently train a statistical model. After 15-30 minutes and 600 annotations, the model is batch-trained using "prodigy ner.batch-train," achieving 87.5% accuracy on a small dataset. The resulting spaCy model is easily deployable for production. Prodigy is a self-hosted, one-time license tool.

Key takeaway

For NLP Engineers needing to develop custom entity recognition models quickly, Prodigy offers an efficient workflow. You can rapidly build domain-specific lexicons and train robust NER models, even with limited initial data, by leveraging active learning and spaCy's capabilities. This approach minimizes annotation effort, allowing you to deploy specialized models in production without extensive manual labeling or cloud service dependencies. Consider Prodigy for projects requiring rapid iteration on new text classification or entity extraction tasks.

Key insights

Active learning with Prodigy and spaCy efficiently creates custom NLP models from minimal annotations.

Principles

Method

1. Create terminology list via "prodigy terms.teach" using seed terms and word vectors. 2. Convert terms to spaCy patterns with "prodigy terms.to-patterns." 3. Annotate NER with "prodigy ner.teach," leveraging active learning. 4. Batch train the final spaCy model using "prodigy ner.batch-train."

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.