From Logistic Regression to GPT-2: Building a Complete Spam Detection & Sentiment Analysis Pipeline
Summary
This article details a two-phase pipeline for spam detection and sentiment analysis, benchmarking eight models across classical ML, deep learning, and transformer paradigms using the UCI SMS Spam Collection dataset. Phase 1 evaluates Logistic Regression, SVM, Random Forest, XGBoost, LSTM, BiLSTM, BERT, and GPT-2, revealing that accuracy is a misleading metric for imbalanced datasets, which comprise 87% ham and 13% spam. Instead, F1-score, Precision-Recall AUC, and ROC-AUC are used, with BERT emerging as the top performer with 11 total errors and a 1.00 ROC-AUC. Phase 2 enriches the dataset with sentiment labels using BiLSTM for classification and VADER for sentiment scoring, demonstrating that 72.2% of spam messages carry a positive sentiment, compared to 41.7% of ham.
Key takeaway
For Machine Learning Engineers building text classifiers on imbalanced datasets, you should prioritize evaluation metrics like F1-score and Precision-Recall AUC over raw accuracy. Focus on confusion matrices to understand specific failure modes, especially false negatives, and consider transformer models like BERT for superior performance, even if they require more resources. Remember that GPT-2's well-calibrated probability estimates offer threshold flexibility for optimizing recall.
Key insights
For imbalanced text classification, prioritize F1-score and Precision-Recall AUC over accuracy and ROC-AUC.
Principles
- Accuracy is deceptive for imbalanced classification.
- Confusion matrices reveal model failure modes.
- Word clouds are powerful for early NLP feature insight.
Method
A two-phase pipeline benchmarks eight models on an 80/20 train-test split, then enriches the dataset with sentiment labels using the best classifier and VADER for cross-signal analysis.
In practice
- Use F1-score for imbalanced classification.
- Analyze confusion matrices to understand error types.
- Retain stopwords if they carry discriminative signal.
Topics
- Spam Detection
- Text Classification
- Class Imbalance
- Transformer Models
- Sentiment Analysis
Code references
Best for: Machine Learning Engineer, Data Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.