Intro to NLP with spaCy (4): Detecting programming languages

2020-03-02 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

This spaCy tutorial video, the fourth in a series, demonstrates how to train a custom Named Entity Recognition (NER) model to detect programming languages, shifting from previous rule-based methods to a machine learning approach. It begins by reviewing spaCy's core NLP object and its modular pipeline, highlighting the NER component's role in entity detection. The process involves preparing training data by converting labeled text into `(text, {"entities": [(start, end, "LABEL")]})` tuples, leveraging existing matchers for efficient generation. The tutorial then details building and improving the training loop, incorporating techniques like mini-batching with compounding batch sizes and dropout for enhanced stability and speed. Initial training took eight minutes for 20 iterations, which was halved after optimization. The video concludes by successfully testing the trained model, which accurately identifies programming languages like "Python" and "JavaScript" in new text.

Key takeaway

For NLP Engineers building custom entity recognition systems, this approach demonstrates how to transition from rule-based methods to a trainable spaCy NER model. You should leverage existing matchers to rapidly generate initial training data and optimize your training loops with mini-batching and dropout for faster, more stable learning. This enables scalable detection of domain-specific entities like programming languages.

Key insights

Training a custom spaCy Named Entity Recognition (NER) model automates programming language detection by learning from labeled data.

Principles

ML models infer rules from data.
spaCy's NLP pipeline is modular.
Batching and dropout improve training.

Method

Create a blank spaCy NLP model, add a custom NER component, generate training data as `(text, {"entities": [(start, end, "LABEL")]})` tuples, then train with an optimized loop using mini-batching and dropout.

In practice

Use `nlp.pipe` for efficient data processing.
Generate training data with existing matchers.
Implement compounding mini-batches for stable learning.

Topics

spaCy
Named Entity Recognition
NLP Pipelines
Machine Learning
Training Data Generation
Programming Language Detection

Best for: Machine Learning Engineer, NLP Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.