How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification

· Source: Naturallanguageprocessing on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

A classical NLP experiment on Kaggle's Spooky Author Identification task demonstrates the effectiveness of traditional methods for stylistic text classification. The project progressed from a Vowpal Wabbit word baseline to a tuned stacked ensemble, aiming to distinguish authors Edgar Allan Poe, Mary Shelley, and H. P. Lovecraft from single sentences. Key improvements included adding punctuation and character n-grams, which boosted VW holdout accuracy from 0.8332 to 0.8553. A TF-IDF ensemble further enhanced probability quality, leading to a final stacked model achieving 0.8687 accuracy and 0.3504 log loss on a 70/30 holdout split. The final Kaggle submission scored 0.30414 private and 0.33621 public log loss, showing sparse count-based features outperformed averaged dense embeddings for this short-text stylistic task.

Key takeaway

For Machine Learning Engineers tackling stylistic text classification, consider robust classical NLP pipelines before defaulting to complex deep learning models. Your focus on detailed feature engineering, including punctuation and character n-grams, combined with ensemble methods like stacking, can yield highly competitive results, as demonstrated by achieving a 0.30414 private log loss on Kaggle's Spooky Author Identification. Prioritize careful validation and probability quality metrics like log loss, as these often reveal the true performance gains.

Key insights

Classical NLP, with careful feature engineering and stacking, excels at stylistic authorship attribution.

Principles

Method

The project built a sequence of classical models: Vowpal Wabbit baselines, a tuned TF-IDF ensemble, and a stacked sparse-text ensemble using out-of-fold predictions, with careful hyperparameter tuning and evaluation.

In practice

Topics

Best for: Machine Learning Engineer, Data Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.