Dirty Data, Broken Models: Cleaning an Audiobook Dataset for Machine Learning
Summary
This article details the data cleaning process for an Audible audiobook dataset, preparing it for machine learning applications. It identifies critical data quality issues, particularly within the "stars" column, which initially contained 665 unique, unstandardized text values like "4.5 out of 5 stars181 ratings". The cleaning involved extracting "star_rating" and "rating_count" into separate numerical fields, resulting in standardized "star_rating" values such as 1.0, 1.5, up to 5.0. Additionally, the "author" and "narrator" columns were cleaned by removing prefixes like "Writtenby:" and "Narratedby:". The transformed dataset is now suitable for analysis and will be used in a subsequent project phase to build a recommendation system using clustering algorithms like K-Means, Hierarchical Clustering, and DBSCAN.
Key takeaway
For data scientists or ML engineers preparing real-world datasets, meticulously cleaning raw data is non-negotiable for model reliability. If your dataset contains combined text fields like "4.5 out of 5 stars181 ratings" or prefixed strings, you must implement robust parsing and standardization. This ensures your features, like "star_rating" and "rating_count", are numerical and consistent, directly impacting the accuracy and effectiveness of subsequent machine learning models, such as recommendation systems.
Key insights
Real-world datasets demand meticulous cleaning to transform inconsistent, raw data into structured features suitable for machine learning models.
Principles
- High-quality data improves model accuracy.
- Unstructured text often contains multiple features.
- Preprocessing enhances data readability.
Method
Identify combined text fields, extract numerical components using regex, convert to numeric types, and remove unnecessary prefixes from string fields to standardize data for ML.
In practice
- Apply regex to parse complex rating strings.
- Convert extracted rating values to floats.
- Strip "Writtenby:" and "Narratedby:" prefixes.
Topics
- Data Cleaning
- Machine Learning Datasets
- Feature Engineering
- Audiobook Data
- Data Preprocessing
- Recommendation Systems
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.