How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification
Summary
A classical NLP experiment on Kaggle's Spooky Author Identification task demonstrates the effectiveness of traditional methods for stylistic text classification. The project progressed from a Vowpal Wabbit word baseline to a tuned stacked ensemble, aiming to distinguish authors Edgar Allan Poe, Mary Shelley, and H. P. Lovecraft from single sentences. Key improvements included adding punctuation and character n-grams, which boosted VW holdout accuracy from 0.8332 to 0.8553. A TF-IDF ensemble further enhanced probability quality, leading to a final stacked model achieving 0.8687 accuracy and 0.3504 log loss on a 70/30 holdout split. The final Kaggle submission scored 0.30414 private and 0.33621 public log loss, showing sparse count-based features outperformed averaged dense embeddings for this short-text stylistic task.
Key takeaway
For Machine Learning Engineers tackling stylistic text classification, consider robust classical NLP pipelines before defaulting to complex deep learning models. Your focus on detailed feature engineering, including punctuation and character n-grams, combined with ensemble methods like stacking, can yield highly competitive results, as demonstrated by achieving a 0.30414 private log loss on Kaggle's Spooky Author Identification. Prioritize careful validation and probability quality metrics like log loss, as these often reveal the true performance gains.
Key insights
Classical NLP, with careful feature engineering and stacking, excels at stylistic authorship attribution.
Principles
- Stylistic tasks benefit from sparse n-gram and character features.
- Punctuation and character n-grams capture writing style.
- Stacking improves probability estimates.
Method
The project built a sequence of classical models: Vowpal Wabbit baselines, a tuned TF-IDF ensemble, and a stacked sparse-text ensemble using out-of-fold predictions, with careful hyperparameter tuning and evaluation.
In practice
- Use Vowpal Wabbit for fast linear text models.
- Implement NB-SVM-style Logistic Regression for text classification.
- Combine base model predictions via stacking for better log loss.
Topics
- Classical NLP
- Authorship Attribution
- Text Classification
- Stacked Ensemble
- TF-IDF
- Feature Engineering
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.