Genre Classification & Music Recommendation at Scale: What 12.9GB of Spark Data Taught Me
Summary
A data science project using the 12.9GB Million Song Dataset (MSD), which expands to 103GB on HDFS, explored genre classification and song recommendation with PySpark. The project highlighted significant challenges with messy, imbalanced real-world data at scale. For genre classification, an initial correlation analysis revealed 16 perfectly correlated feature pairs within the 20-column "Area Method of Moments" audio feature set, problematic for linear models. Binary classification for "Electronic" genre (9.67% of tracks) showed Gradient Boosted Trees with observation reweighting achieved the best AUROC of 0.7175, emphasizing AUROC and recall over accuracy for imbalanced data. Scaling to 21 genres exacerbated imbalance, with "Pop_Rock" having 237,000 tracks versus "Holiday" with 200. A collaborative filtering recommender using Alternating Least Squares on 48.3 million play records from 1 million users and 384,000 songs yielded low Precision@10 (0.0286) and NDCG@10 (0.0282), partly due to heavily skewed play counts and outliers.
Key takeaway
For Data Scientists building large-scale music classification or recommendation systems, prioritize rigorous data preprocessing and validation over immediate model building. You should always visualize features to catch issues like multicollinearity and use metrics beyond accuracy, such as AUROC and recall, for imbalanced datasets. Investigate and cap outlier play counts before training recommenders to prevent distortion, ensuring your models reflect genuine user preferences.
Key insights
Real-world music data projects demand robust data engineering and careful feature validation before modeling.
Principles
- Visualize features to detect multicollinearity early.
- Accuracy metrics mislead with imbalanced classes.
- Outliers can significantly distort recommender systems.
Method
The project involved PySpark for schema generation, correlation analysis, testing classification models with resampling, and collaborative filtering using ALS.
In practice
- Use AUROC and recall for imbalanced classification.
- Cap or filter outlier play counts in recommender data.
- Automate schema generation for feature datasets.
Topics
- PySpark
- Genre Classification
- Music Recommendation
- Million Song Dataset
- Data Imbalance
- Collaborative Filtering
- Feature Engineering
Code references
Best for: Data Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.