NLP: From Prototype to Production

2023-02-24 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, extended

Summary

Inez Montani, co-founder and CEO of Explosion, discussed Natural Language Processing (NLP) from prototype to production, focusing on their tools, spaCy and Prodigy. spaCy, an open-source Python library launched in 2015, provides production-ready NLP components prioritizing speed and accuracy. Prodigy, their commercial annotation tool, facilitates efficient data labeling for machine learning models. Montani highlighted that production NLP addresses specific business needs, contrasting with research's focus on general knowledge. Key skills for production include strong software development, basic linguistics, and domain expertise. Technical challenges involve continuous improvement, adapting to evolving data, and robust evaluation. Montani also addressed Large Language Models (LLMs), seeing them as valuable tools for data generation and annotation, especially for creating high-quality datasets for smaller, specialized models, and emphasized the need for local, privacy-preserving solutions. She confirmed spaCy V4 is in development, promising API refinements and efficiency improvements.

Key takeaway

For AI Engineers building NLP solutions, prioritize understanding your specific business problem and its required output over chasing peak academic accuracy. Your focus should be on developing maintainable, efficient systems, even if that means starting with simpler baselines like regular expressions. Employ LLMs as tools for data generation and annotation to rapidly build high-quality, specialized datasets, but ensure your deployment strategy accounts for data privacy and local execution needs.

Key insights

Production NLP prioritizes specific business utility and maintainability over raw academic accuracy.

Principles

Strong software development skills are crucial for production NLP.
Basic linguistic understanding aids effective NLP solution design.
Start with simple baselines (e.g., regex) before complex ML.

Method

Utilize large language models (LLMs) in annotation workflows to generate and suggest labels, then manually correct to create high-quality, specialized datasets for smaller downstream models.

In practice

Employ LLMs for initial data generation and labeling suggestions.
Prioritize local, privacy-preserving models for sensitive data.
Evaluate model utility in context, not just accuracy scores.

Topics

Natural Language Processing
spaCy Library
Prodigy Annotation Tool
Large Language Models
MLOps
Information Extraction

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.