spaCy v3: Custom trainable relation extraction component
Summary
Sofie, a core developer of spaCy, details the implementation of a custom trainable relation extraction component in spaCy v3. This new version introduces a flexible training configuration system and easier integration of custom components. The tutorial focuses on building a Machine Learning model in Thinc to predict relationships between named entities, specifically biomedical relations between genes and proteins. It covers creating a spaCy pipeline component, training it with the new configuration system, and demonstrating performance enhancement using a pretrained Transformer model from HuggingFace. Initial evaluation on a BioNLP Shared Task 2011 dataset yielded a 42% F-score, which significantly increased to 72% when integrating a HuggingFace Transformer, highlighting the impact of pretrained models on accuracy, especially with smaller datasets.
Key takeaway
For NLP Engineers building custom relation extraction systems, spaCy v3 offers robust tools to integrate advanced models. You should utilize the new configuration system to define custom trainable components and consider integrating HuggingFace Transformers via "spacy-transformers". This approach can boost your model's F-score significantly, as demonstrated by a 30 percentage point increase to 72% on biomedical data, even with smaller datasets. Explore the provided spaCy project for a baseline implementation to adapt for your specific domain.
Key insights
spaCy v3 enables custom trainable components and Transformer integration for significant NLP performance gains.
Principles
- spaCy 3's config system enhances transparency and reproducibility.
- Shared embedding layers (Tok2Vec, Transformer) improve pipeline efficiency and accuracy.
- Pretrained Transformer models can significantly boost performance on small datasets.
Method
Implement a custom spaCy 3 component by subclassing "TrainablePipe", defining a Thinc ML model for forward/backward passes, and registering components/architectures via the new configuration system.
In practice
- Use "spacy-transformers" for HuggingFace model integration.
- Define custom attributes (e.g., "doc._.rel") for storing relation annotations.
- Utilize "spaCy projects" for managing end-to-end NLP workflows.
Topics
- spaCy v3
- Relation Extraction
- HuggingFace Transformers
- Thinc Deep Learning
- Custom NLP Components
- Biomedical NLP
Best for: Machine Learning Engineer, NLP Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.