State-of-the-Art Transformer Pipelines in spaCy

2023-11-10 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, extended

Summary

The presentation details spaCy's state-of-the-art Transformer pipelines for natural language processing, a Python library installable via "Pip install spaCy". It outlines core components like tokenization, tok2vec for token embeddings, lemmatization, and span/document classification. For long documents, spaCy employs a unique span-splitting technique with overlaps and reduction, leveraging the "curated-transformers" library, which supports models like BERT and LLMs (Llama 2, Falcon). The talk covers inference, demonstrating how to use "spaCy download" for pre-trained models and build a FastAPI web service. It also details pipeline training using the GMB corpus (over 10,000 English public domain texts), emphasizing configuration via a single file validated by Pydantic, and workflow management with "spaCy projects" for tasks like data conversion, training ("spaCy train"), and evaluation ("spaCy evaluate"). A key discussion highlights that supervised models often outperform LLMs like GPT-3 and Claude 2 on predictive tasks, especially with even small amounts of annotated data.

Key takeaway

For NLP Engineers building predictive applications, you should evaluate fine-tuned Transformer models in spaCy against LLMs, especially when domain-specific annotated data is available. Even a small amount of labeled data can enable a spaCy pipeline to achieve higher accuracy and efficiency than large language models like GPT-3 or Claude 2 for tasks like text classification or named entity recognition. Utilize "spaCy projects" to streamline your training and deployment workflows, ensuring reproducibility and efficient resource use.

Key insights

spaCy offers robust, configurable Transformer pipelines for NLP tasks, often outperforming LLMs on predictive tasks with modest data.

Principles

Tokenization is foundational for NLP, separating words from punctuation.
Supervised models excel over LLMs for predictive tasks with sufficient data.
Centralized configuration enhances reproducibility in NLP pipelines.

Method

spaCy pipelines process raw text through a tokenizer, tok2vec for embeddings, and optional components like lemmatizers or NER. Long documents are handled by splitting into overlapping spans, processing with a Transformer, and reducing representations.

In practice

Use "spaCy download" for pre-trained models for standard NLP tasks.
Employ "spaCy projects" to manage end-to-end NLP model training workflows.
Consider fine-tuning a Transformer model over LLMs for predictive tasks with annotated data.

Topics

spaCy Library
Transformer Pipelines
NLP Model Training
Named Entity Recognition
Document Classification
Large Language Models

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.