Predicting Online News Popularity: A Machine Learning Project That Taught Me More About Data…
Summary
A machine learning project analyzed 39,644 Mashable articles using 58 pre-publication attributes to predict social share counts. The project found that raw share counts were extremely right-skewed, necessitating a log-transformation of the target variable, which reduced the Root Mean Squared Error (RMSE) from 14,782 to 0.88. Three models were compared: Linear Regression (R²=0.113), Decision Tree (R²=0.067), and Random Forest (R²=0.149). The Random Forest model, tuned with mtry=5 and ntree=300, achieved a 32% relative improvement over the linear baseline and an out-of-bag variance explained of 16.25%. Key predictors included keyword quality, self-referencing content, and specific LDA topics, while sentiment polarity and weekend publication had minimal impact. The project achieved an R² of 0.149, which aligns with the established ceiling of 15-20% explained variance for this problem, indicating that the remaining variance is driven by post-publication social dynamics.
Key takeaway
For AI Product Managers evaluating content promotion strategies, recognize that pre-publication data can predict approximately 15% of an article's social shares. You should integrate Random Forest models using log-transformed share counts to identify likely underperformers before publication, allowing you to reallocate promotional budgets more effectively and avoid bad bets. Consider exploring gradient-boosted trees for potential further improvements.
Key insights
Log-transforming skewed target variables is critical for effective predictive modeling of online content popularity.
Principles
- Ensemble models capture complex non-linear interactions better than single trees.
- Pre-publication data has a practical limit for predicting social virality.
- Keyword quality and content self-referencing strongly predict article shares.
Method
Predicting online news popularity involves log-transforming share counts, using Random Forest with tuned hyperparameters, and evaluating pre-publication attributes like keyword quality and topic alignment.
In practice
- Prioritize articles with high-performing keywords.
- Link to previously viral content for audience familiarity.
- Focus on emotional intensity over sentiment polarity.
Topics
- Online News Popularity Prediction
- Random Forest Model
- Log Transformation
- Predictive Features
- Editorial Intelligence
Best for: Data Scientist, Director of AI/ML, AI Product Manager
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.