spaCy’s entity recognition model: incremental parsing with Bloom embeddings & residual CNNs

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Explosion AI released spaCy v2, updating its natural language processing library with deep learning methodologies for tasks like named entity recognition (NER), tagging, and parsing. The new version introduces a statistical model featuring a transition-based framework, Bloom embeddings for learning dense word representations, and residual convolutional neural networks (CNNs) for contextual encoding. spaCy v2's en_core_web_lg model achieves 92% unlabeled attachment score (UAS) for dependency parsing and 97.2% for part-of-speech tagging. Its en_core_web_sm model demonstrates a 25% error reduction in NER compared to spaCy v1. The article also discusses Explosion AI's ecosystem, including the deep learning library Thinc and the annotation tool Prodigy, emphasizing the critical role of domain-specific, current training data for effective NER, especially given the rapid obsolescence of general corpora.

Key takeaway

For NLP engineers building or enhancing named entity recognition systems, you should prioritize acquiring and maintaining current, domain-specific training data. Relying solely on general pre-trained models risks outdated entity recognition, as seen with "Trump" examples. Consider spaCy v2's transition-based approach with Bloom embeddings and residual CNNs for robust performance, and integrate annotation tools like Prodigy to efficiently generate the precise evaluation and fine-tuning data your specific use cases demand.

Key insights

spaCy v2's NER model leverages transition-based parsing, Bloom embeddings, and residual CNNs, underscoring the need for domain-specific training data.

Principles

Method

spaCy's NER uses an "embed, encode, attend, predict" framework. It extracts token features, applies Bloom embeddings, then uses residual trigram CNNs for contextual encoding. Manual feature extraction from state vectors feeds a multi-layer perceptron for transition action prediction.

In practice

Topics

Best for: AI Engineer, Research Scientist, NLP Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.