Intro to NLP with spaCy (3): Detecting programming languages

2019-12-07 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This session, part three of an "Intro to NLP with spaCy" series, details an iterative approach to detecting programming languages in text using spaCy's Matcher. Building on previous work that used part-of-speech tags and multi-token pattern matching, the focus shifts to evaluating and refining the Matcher. The process involves creating a custom Jupyter UI for visualizing matches, expanding patterns to include languages like C#, C++, and Java, and critically, manual data labeling. Through labeling 500 instances from Stack Overflow questions, insights were gained, such as distinguishing "ASP.NET" as a tool rather than a language. Error analysis of false positives and negatives, including version numbers (e.g., "PHP 5") and tool names (e.g., "SQL Server"), guided further pattern adjustments. The approach is quantitatively evaluated using scikit-learn's confusion matrix and classification report, showing improved precision and recall, and establishing a baseline for future deep learning comparisons.

Key takeaway

For NLP Engineers developing rule-based text classification systems, prioritize manual data labeling and an iterative error analysis workflow. This hands-on approach is invaluable for uncovering critical domain context, such as differentiating "ASP.NET" as a tool from a programming language, and for identifying specific pattern shortcomings like version numbers. Systematically analyze false positives and negatives, then refine your Matcher patterns. Quantitatively track improvements using scikit-learn's classification reports to ensure robust and accurate system development, establishing a strong baseline for any future model comparisons.

Key insights

Manual labeling and iterative error analysis are crucial for refining rule-based NLP systems and uncovering critical domain context.

Principles

Labeling reveals critical domain context.
Iterative refinement improves rule-based systems.
Metrics guide targeted improvements.

Method

Iteratively refine spaCy Matcher patterns by creating a custom Jupyter UI for error exploration, manually labeling data, analyzing false positives/negatives, and evaluating with scikit-learn classification metrics.

In practice

Develop custom UIs for model behavior exploration.
Manually label data to uncover domain context.
Use scikit-learn for confusion matrices and classification reports.

Topics

spaCy
Natural Language Processing
Rule-based Systems
Text Classification
Data Labeling
Error Analysis
Classification Metrics

Best for: Machine Learning Engineer, NLP Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.