How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification
Summary
The article details an exploration of classical Natural Language Processing (NLP) techniques for the Kaggle Spooky Author Identification competition. This challenge involves classifying single sentences from gothic fiction as written by Edgar Allan Poe, Mary Wollstonecraft Shelley, or H. P. Lovecraft, emphasizing stylistic cues over content. The author built a sequence of models, starting with a Vowpal Wabbit word baseline, progressing to a richer VW model incorporating punctuation and character n-grams, then a tuned TF-IDF ensemble, and finally a stacked sparse-text ensemble. The strongest classical pipeline achieved 0.8687 accuracy and 0.3504 log loss on a 70/30 holdout split. The final stacked submission scored 0.30414 private and 0.33621 public log loss on Kaggle, demonstrating that classical NLP, with careful representation and validation, can be highly effective for stylistic text classification.
Key takeaway
For Machine Learning Engineers tackling stylistic text classification, prioritize detailed feature engineering and robust ensemble methods. Focus on capturing subtle stylistic cues through character n-grams, punctuation, and sparse word features. Combining this with a well-tuned stacked ensemble can yield highly competitive results. This approach achieved a 0.30414 private log loss on Kaggle's Spooky Author Identification, proving efficient and effective for short-text tasks.
Key insights
Classical NLP, with careful feature engineering and ensemble stacking, excels at stylistic text classification.
Principles
- Stylistic text classification benefits from fine-grained features.
- Ensemble stacking improves probability quality over simple averaging.
- Rigorous validation setups prevent misleading performance estimates.
Method
The proposed method involves building a sequence of classical models (VW, TF-IDF), incorporating style-aware features like punctuation and character n-grams, and then stacking their out-of-fold predictions with a meta-learner.
In practice
- Use Vowpal Wabbit for fast, sparse linear text models.
- Incorporate character n-grams and punctuation for style analysis.
- Apply NB-SVM-style Logistic Regression for weighted sparse features.
Topics
- Authorship Attribution
- Classical NLP
- Text Classification
- Vowpal Wabbit
- TF-IDF
- Ensemble Stacking
- Feature Engineering
Code references
Best for: Machine Learning Engineer, NLP Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.