Training a Named Entity Recognition Model with Prodigy and Transfer Learning

2020-03-16 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This content demonstrates an end-to-end workflow for training a Named Entity Recognition (NER) model to identify food ingredients in text, utilizing Prodigy for annotation and transfer learning. The process begins by creating a phrase list and match patterns using `sense2vec.teach` with 2015 Reddit-trained word vectors. A sample of 10,000 Reddit comments from the r/Cooking subreddit is then manually annotated using `ner.manual`, with initial matches provided by patterns. A first model is trained using `prodigy train` with pretrained token-to-vector representations, achieving decent accuracy. Active learning with `ner.correct` refines annotations, correcting model predictions to expand the dataset to over 1000 annotations. The final model is retrained, showing improved accuracy. This model is then applied to over 2 million Reddit comments, processing 7 years of data in 5.6 CPU hours, to extract ingredient mentions and compute monthly counts. Results are visualized as a bar chart race using Flourish, highlighting trends of variable ingredients over time.

Key takeaway

For NLP Engineers building custom NER models, this workflow offers a validated approach to achieve high accuracy with reduced manual effort. You should prioritize early validation of annotation policies and employ active learning with tools like Prodigy to efficiently scale your training data. This strategy allows you to quickly iterate and deploy domain-specific models, even with limited initial labeled examples, enabling rapid analysis of large text corpora.

Key insights

Efficient NER model training for specific domains is achieved through semi-automatic annotation and transfer learning with tools like Prodigy.

Principles

Data is core; develop it like code.
Validate label schemes and policies early.
Pretrained representations enhance small datasets.

Method

The method involves creating phrase lists, manual annotation with pattern assistance, initial model training with pretrained token-to-vector layers, active learning for correction, final model retraining, and large-scale entity extraction for trend analysis.

In practice

Use Prodigy for semi-automatic annotation.
Employ `sense2vec.teach` for phrase list generation.
Apply `ner.correct` for active learning.

Topics

Named Entity Recognition
Prodigy Annotation Tool
Transfer Learning
Active Learning
spaCy NLP Library
Data Annotation Workflows

Best for: NLP Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.