How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification

2026-06-29 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

The article details an exploration of classical Natural Language Processing (NLP) techniques for the Kaggle Spooky Author Identification competition. This challenge involves classifying single sentences from gothic fiction as written by Edgar Allan Poe, Mary Wollstonecraft Shelley, or H. P. Lovecraft, emphasizing stylistic cues over content. The author built a sequence of models, starting with a Vowpal Wabbit word baseline, progressing to a richer VW model incorporating punctuation and character n-grams, then a tuned TF-IDF ensemble, and finally a stacked sparse-text ensemble. The strongest classical pipeline achieved 0.8687 accuracy and 0.3504 log loss on a 70/30 holdout split. The final stacked submission scored 0.30414 private and 0.33621 public log loss on Kaggle, demonstrating that classical NLP, with careful representation and validation, can be highly effective for stylistic text classification.

Key takeaway

For Machine Learning Engineers tackling stylistic text classification, prioritize detailed feature engineering and robust ensemble methods. Focus on capturing subtle stylistic cues through character n-grams, punctuation, and sparse word features. Combining this with a well-tuned stacked ensemble can yield highly competitive results. This approach achieved a 0.30414 private log loss on Kaggle's Spooky Author Identification, proving efficient and effective for short-text tasks.

Key insights

Classical NLP, with careful feature engineering and ensemble stacking, excels at stylistic text classification.

Principles

Stylistic text classification benefits from fine-grained features.
Ensemble stacking improves probability quality over simple averaging.
Rigorous validation setups prevent misleading performance estimates.

Method

The proposed method involves building a sequence of classical models (VW, TF-IDF), incorporating style-aware features like punctuation and character n-grams, and then stacking their out-of-fold predictions with a meta-learner.

In practice

Use Vowpal Wabbit for fast, sparse linear text models.
Incorporate character n-grams and punctuation for style analysis.
Apply NB-SVM-style Logistic Regression for weighted sparse features.

Topics

Authorship Attribution
Classical NLP
Text Classification
Vowpal Wabbit
TF-IDF
Ensemble Stacking
Feature Engineering

Code references

Nahid-ahmdv/Spooky_Author_Identification

Best for: Machine Learning Engineer, NLP Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.