David Dodson: spaCy in the News: Quartz’s NLP pipeline
Summary
Quartz developed Scio2, a living natural language processing pipeline built on spaCy, to analyze global business news with domain-specific context. This pipeline moves beyond general AI, focusing on time-sensitive news analysis. Scio2 processes a corpus of 70,000 articles, totaling 101.4 million text blocks, to train its base model with 85,000 labeled sentences. A core component is the language graph, which stores analyzed content, tracks linguistic changes over time (e.g., the emergence of "5G"), and serves as a dynamic resource for extracting new training data to retrain spaCy models. The system performs rich analysis, including custom entity recognition (like "construct" for evolving terms) and classification of stylistic elements such as active/passive voice, aligning with Quartz's editorial style guide. This enables real-time content analysis and dynamic model evolution.
Key takeaway
For NLP Engineers building systems for rapidly evolving content, you should prioritize domain-specific pipelines over general AI approaches. Implement a dynamic language graph to store analyzed content and track linguistic shifts, enabling continuous model retraining with relevant, time-sensitive data. This strategy ensures your models remain accurate and contextually aware, adapting to new terminology and evolving narratives as they emerge.
Key insights
Domain-specific NLP pipelines, like Scio2, excel in time-sensitive news analysis by dynamically adapting to evolving language and context.
Principles
- Embrace domain-specific AI over general AI.
- Time sensitivity is critical for news analysis.
- Language and entities evolve, requiring dynamic models.
Method
Content is analyzed by spaCy, then added to a language graph. This graph stores relationships, tracks linguistic evolution, and serves as a dynamic source for extracting training data to retrain spaCy models.
In practice
- Implement a language graph for corpus analysis storage.
- Dynamically extract training data from the graph.
- Use spaCy for real-time content classification.
Topics
- Natural Language Processing
- spaCy
- Domain-Specific AI
- Language Graph
- News Analysis
- Entity Recognition
Best for: AI Engineer, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.