spaCy’s entity recognition model: incremental parsing with Bloom embeddings & residual CNNs

2017-11-12 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Explosion AI released spaCy v2, updating its natural language processing library with deep learning methodologies for tasks like named entity recognition (NER), tagging, and parsing. The new version introduces a statistical model featuring a transition-based framework, Bloom embeddings for learning dense word representations, and residual convolutional neural networks (CNNs) for contextual encoding. spaCy v2's en_core_web_lg model achieves 92% unlabeled attachment score (UAS) for dependency parsing and 97.2% for part-of-speech tagging. Its en_core_web_sm model demonstrates a 25% error reduction in NER compared to spaCy v1. The article also discusses Explosion AI's ecosystem, including the deep learning library Thinc and the annotation tool Prodigy, emphasizing the critical role of domain-specific, current training data for effective NER, especially given the rapid obsolescence of general corpora.

Key takeaway

For NLP engineers building or enhancing named entity recognition systems, you should prioritize acquiring and maintaining current, domain-specific training data. Relying solely on general pre-trained models risks outdated entity recognition, as seen with "Trump" examples. Consider spaCy v2's transition-based approach with Bloom embeddings and residual CNNs for robust performance, and integrate annotation tools like Prodigy to efficiently generate the precise evaluation and fine-tuning data your specific use cases demand.

Key insights

spaCy v2's NER model leverages transition-based parsing, Bloom embeddings, and residual CNNs, underscoring the need for domain-specific training data.

Principles

Open-source value extends beyond code to complementary tools and services.
Domain-specific, current training data is paramount for effective NER.
Transition-based structured prediction allows flexible, arbitrary feature functions.

Method

spaCy's NER uses an "embed, encode, attend, predict" framework. It extracts token features, applies Bloom embeddings, then uses residual trigram CNNs for contextual encoding. Manual feature extraction from state vectors feeds a multi-layer perceptron for transition action prediction.

In practice

Use Prodigy for efficient, binary-decision-focused data annotation.
Fine-tune pre-trained models on domain-specific entities.
Employ CNNs for NLP encoding for parallel processing and bounded context.

Topics

spaCy v2
Named Entity Recognition
Bloom Embeddings
Residual CNNs
Transition-Based Parsing
Data Annotation Tools

Best for: AI Engineer, Research Scientist, NLP Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.