FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCy
Summary
Ines, co-founder of Explosion AI and lead developer of spaCy and Prodigy, offers practical advice for NLP annotation and model training. The guidance covers structuring annotation tasks, distinguishing between binary and manual workflows based on project goals like gold-standard datasets or active learning. It emphasizes rejecting partially correct spans in active learning for Named Entity Recognition (NER) to provide precise feedback, and clarifies when to "skip" irrelevant data versus "rejecting" incorrect model suggestions. The brief also addresses annotating long texts, recommending sentence or paragraph-level labeling for better model learning and annotator focus. Furthermore, it compares fine-tuning pre-trained models, which requires less data but risks "catastrophic forgetting," with training from scratch for custom label schemes. Finally, it highlights the power of combining statistical models with application-specific rules, such as expanding NER entities with titles, for extracting complex relationships.
Key takeaway
For NLP Engineers designing annotation workflows, strategically choose between binary and manual annotation based on your project's data and goals. Prioritize binary tasks for active learning and improving existing categories, as this accelerates data collection and quality control. When dealing with custom labels or noisy data, consider training models from scratch or chaining classifiers to manage complexity and avoid "catastrophic forgetting" from fine-tuning. Always provide precise feedback by rejecting partially correct spans.
Key insights
Effective NLP annotation and training require strategic task design, precise feedback, and combining statistical models with rules.
Principles
- Break down complex annotation tasks into binary decisions.
- Reject partially correct spans for precise model feedback.
- Combine statistical models with application-specific rules.
Method
For noisy data, chain two classifiers: one for noise filtering, then the objective classifier on the cleaned subset.
In practice
- Use binary annotation for active learning and category improvement.
- Label long texts at sentence/paragraph level for classification.
- Expand NER entities with titles using syntactic rules.
Topics
- NLP Annotation
- Prodigy
- spaCy
- Active Learning
- Named Entity Recognition
- Model Training Strategies
Best for: Machine Learning Engineer, NLP Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.