Predicting Online News Popularity: A Machine Learning Project That Taught Me More About Data…

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Publishing & Journalism · Depth: Intermediate, short

Summary

A machine learning project analyzed 39,644 Mashable articles using 58 pre-publication attributes to predict social share counts. The project found that raw share counts were extremely right-skewed, necessitating a log-transformation of the target variable, which reduced the Root Mean Squared Error (RMSE) from 14,782 to 0.88. Three models were compared: Linear Regression (R²=0.113), Decision Tree (R²=0.067), and Random Forest (R²=0.149). The Random Forest model, tuned with mtry=5 and ntree=300, achieved a 32% relative improvement over the linear baseline and an out-of-bag variance explained of 16.25%. Key predictors included keyword quality, self-referencing content, and specific LDA topics, while sentiment polarity and weekend publication had minimal impact. The project achieved an R² of 0.149, which aligns with the established ceiling of 15-20% explained variance for this problem, indicating that the remaining variance is driven by post-publication social dynamics.

Key takeaway

For AI Product Managers evaluating content promotion strategies, recognize that pre-publication data can predict approximately 15% of an article's social shares. You should integrate Random Forest models using log-transformed share counts to identify likely underperformers before publication, allowing you to reallocate promotional budgets more effectively and avoid bad bets. Consider exploring gradient-boosted trees for potential further improvements.

Key insights

Log-transforming skewed target variables is critical for effective predictive modeling of online content popularity.

Principles

Method

Predicting online news popularity involves log-transforming share counts, using Random Forest with tuned hyperparameters, and evaluating pre-publication attributes like keyword quality and topic alignment.

In practice

Topics

Best for: Data Scientist, Director of AI/ML, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.