Practical transfer learning for NLP with spaCy and Prodigy

2019-05-09 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, extended

Summary

The content introduces a modern NLP workflow leveraging spaCy and Prodigy to address common machine learning project challenges. It highlights the "swamp of uncertainty" and "quicksand of sunk costs" in traditional model development, where models struggle with general language understanding. Explosion AI's spaCy implements "Language Modeling with Approximate Outputs" (LMAO), a production-ready transfer learning approach that predicts word vectors at 10,000 words/second, resulting in a 3% accuracy improvement with a 3MB model. The workflow also integrates Prodigy for active learning, enabling data scientists to iterate faster on complex labeling tasks by focusing on uncertain examples. Finally, it stresses continuous model maintenance using Prodigy Scale to counter language evolution, ensuring models remain relevant and accurate over time.

Key takeaway

For NLP Engineers building production systems, you should integrate transfer learning and active learning early to mitigate the "swamp of uncertainty." Pre-train models with spaCy's LMAO for faster initial results and then use Prodigy's active learning to refine your annotation strategy. Plan for ongoing model maintenance with tools like Prodigy Scale to ensure your deployed models remain accurate as language evolves, avoiding costly degradation.

Key insights

Modern NLP workflows combine pre-training, active learning, and continuous maintenance to accelerate development and improve model longevity.

Principles

Pre-train models on vast unlabeled text for general language understanding.
Involve data scientists directly in active learning for labeling.
Anticipate language evolution; plan for continuous model updates.

Method

Pre-train a CNN to predict word vectors (LMAO) using spaCy's `pretrain` command. Integrate Prodigy for active learning, iteratively labeling uncertain examples. Use Prodigy Scale for ongoing annotation and quality monitoring.

In practice

Use spaCy's LMAO for fast, CPU-friendly pre-training.
Employ active learning to refine label schemes and reduce annotation effort.
Implement continuous annotation to adapt to evolving language.

Topics

Natural Language Processing
Transfer Learning
Active Learning
spaCy
Prodigy
Data Annotation
Model Maintenance

Best for: Data Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.