spaCy v3: State-of-the-art NLP from Prototype to Production

2021-06-04 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

spaCy v3, an open-source Python library for natural language processing, has been released, building on its 20 million-plus downloads since its first alpha in early 2015. This version introduces advanced capabilities, notably integrating Transformer models via the Hugging Face Transformers library and PyTorch, enabling high accuracy for tasks like Named Entity Recognition using models such as RoBERTa-based pipelines. Key enhancements include a robust configuration system based on Python's config parser with JSON values and registered functions, simplifying complex ML setups and ensuring validation. A new project system, inspired by DVC, streamlines multi-stage ML workflows, offering asset management, command execution, and remote caching for easier collaboration and production deployment. The update also features improved developer experience through type hints, better error handling, and integrations with tools like Weights & Biases, Ray, and Streamlit, making spaCy more production-ready and extensible.

Key takeaway

For NLP Engineers or ML teams building production-grade applications, spaCy v3 significantly streamlines the transition from prototype to deployment. You should explore its Transformer-based pipelines for leading accuracy and leverage the new configuration and project systems to manage complex workflows and ensure robust, scalable deployments. Consider integrating with tools like Weights & Biases for experiment tracking and Ray for distributed execution to optimize your development and operational efficiency.

Key insights

spaCy v3 delivers production-ready NLP by integrating Transformers, a robust config system, and project management for scalable, highly accurate applications.

Principles

Production-focused design prioritizes deployability and developer experience.
Deep learning systems require robust configuration management.
Waiting on research trends improves feature planning accuracy.

Method

The spaCy v3 configuration system uses registered functions to build object trees bottom-up, validating arguments with Pydantic models for robust ML pipeline setup.

In practice

Use Transformer-based pipelines for improved accuracy in NER.
Structure multi-stage ML projects with the spaCy project system.
Define custom components and models using the "language.component" and "language.factory" decorators.

Topics

spaCy v3
Transformer Models
NLP Pipelines
MLOps
Configuration Systems
Data Annotation

Best for: MLOps Engineer, Machine Learning Engineer, NLP Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.