Introducing Span Categorization in Prodigy and spaCy
Summary
Span Categorization (spancat) is a new spaCy component designed to extract and classify long, overlapping text spans, differing from Named Entity Recognition (NER) by offering explicit control over candidate spans, providing confidence scores, and exhibiting less edge sensitivity. It comprises a suggester, which extracts span candidates, and a classifier that predicts label probabilities. For data annotation, Prodigy utilizes recipes like "spans.manual" to label datasets, exemplified by a 25,000-recipe food dataset with "ingredient" and "instruction" labels. Annotation consistency is enhanced via "prodigy.json" for guidelines. The process can be accelerated using rule-based patterns for pre-selection and by training temporary spancat models (achieving F-scores around 0.45) to predict and correct labels with "spans.correct", before exporting data to spaCy format.
Key takeaway
For NLP Engineers or Data Scientists building text extraction systems where Named Entity Recognition (NER) falls short on long, overlapping, or context-sensitive spans, you should explore spaCy's new Span Categorization (spancat). This component provides greater control over span candidates and offers meaningful confidence scores. Utilize Prodigy's "spans.manual" recipe, patterns, and temporary models to efficiently annotate and refine your datasets, ensuring high-quality training data for complex entity extraction.
Key insights
Span Categorization (spancat) in spaCy extracts and classifies long, overlapping text spans, providing explicit control and confidence scores.
Principles
- Spancat is superior to NER for complex, overlapping, or context-dependent spans.
- Annotation guidelines are crucial for dataset consistency across annotators.
- Iterative training of temporary models significantly accelerates data labeling.
Method
Annotate spans using Prodigy's "spans.manual" recipe, define labels and guidelines, then accelerate with rule-based patterns and temporary models via "spans.correct" for iterative refinement.
In practice
- Implement custom suggester functions to control span candidate generation.
- Embed HTML guidelines in "prodigy.json" for consistent annotation rules.
- Utilize temporary spancat models to pre-label data for faster correction cycles.
Topics
- Span Categorization
- spaCy
- Prodigy
- Text Annotation
- Named Entity Recognition
- NLP Components
Best for: Machine Learning Engineer, NLP Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.