FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCy

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

Ines, co-founder of Explosion AI and lead developer of spaCy and Prodigy, offers practical advice for NLP annotation and model training. The guidance covers structuring annotation tasks, distinguishing between binary and manual workflows based on project goals like gold-standard datasets or active learning. It emphasizes rejecting partially correct spans in active learning for Named Entity Recognition (NER) to provide precise feedback, and clarifies when to "skip" irrelevant data versus "rejecting" incorrect model suggestions. The brief also addresses annotating long texts, recommending sentence or paragraph-level labeling for better model learning and annotator focus. Furthermore, it compares fine-tuning pre-trained models, which requires less data but risks "catastrophic forgetting," with training from scratch for custom label schemes. Finally, it highlights the power of combining statistical models with application-specific rules, such as expanding NER entities with titles, for extracting complex relationships.

Key takeaway

For NLP Engineers designing annotation workflows, strategically choose between binary and manual annotation based on your project's data and goals. Prioritize binary tasks for active learning and improving existing categories, as this accelerates data collection and quality control. When dealing with custom labels or noisy data, consider training models from scratch or chaining classifiers to manage complexity and avoid "catastrophic forgetting" from fine-tuning. Always provide precise feedback by rejecting partially correct spans.

Key insights

Effective NLP annotation and training require strategic task design, precise feedback, and combining statistical models with rules.

Principles

Method

For noisy data, chain two classifiers: one for noise filtering, then the objective classifier on the cleaned subset.

In practice

Topics

Best for: Machine Learning Engineer, NLP Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.