spaCy’s entity recognition model: incremental parsing with Bloom embeddings & residual CNNs
Summary
Explosion AI released spaCy v2, updating its natural language processing library with deep learning methodologies for tasks like named entity recognition (NER), tagging, and parsing. The new version introduces a statistical model featuring a transition-based framework, Bloom embeddings for learning dense word representations, and residual convolutional neural networks (CNNs) for contextual encoding. spaCy v2's en_core_web_lg model achieves 92% unlabeled attachment score (UAS) for dependency parsing and 97.2% for part-of-speech tagging. Its en_core_web_sm model demonstrates a 25% error reduction in NER compared to spaCy v1. The article also discusses Explosion AI's ecosystem, including the deep learning library Thinc and the annotation tool Prodigy, emphasizing the critical role of domain-specific, current training data for effective NER, especially given the rapid obsolescence of general corpora.
Key takeaway
For NLP engineers building or enhancing named entity recognition systems, you should prioritize acquiring and maintaining current, domain-specific training data. Relying solely on general pre-trained models risks outdated entity recognition, as seen with "Trump" examples. Consider spaCy v2's transition-based approach with Bloom embeddings and residual CNNs for robust performance, and integrate annotation tools like Prodigy to efficiently generate the precise evaluation and fine-tuning data your specific use cases demand.
Key insights
spaCy v2's NER model leverages transition-based parsing, Bloom embeddings, and residual CNNs, underscoring the need for domain-specific training data.
Principles
- Open-source value extends beyond code to complementary tools and services.
- Domain-specific, current training data is paramount for effective NER.
- Transition-based structured prediction allows flexible, arbitrary feature functions.
Method
spaCy's NER uses an "embed, encode, attend, predict" framework. It extracts token features, applies Bloom embeddings, then uses residual trigram CNNs for contextual encoding. Manual feature extraction from state vectors feeds a multi-layer perceptron for transition action prediction.
In practice
- Use Prodigy for efficient, binary-decision-focused data annotation.
- Fine-tune pre-trained models on domain-specific entities.
- Employ CNNs for NLP encoding for parallel processing and bounded context.
Topics
- spaCy v2
- Named Entity Recognition
- Bloom Embeddings
- Residual CNNs
- Transition-Based Parsing
- Data Annotation Tools
Best for: AI Engineer, Research Scientist, NLP Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.