5 categorical encodings, 5 wrong answers: one-hot, label, target, ordered target, and what CatBoost actually does
Summary
The provided content introduces categorical features as a ubiquitous challenge in tabular machine learning projects, citing examples like country codes, device types, browsers, postcodes, merchant IDs, customer segments, and SKUs. It highlights their common appearance at the outset of any structured data analysis. The title further indicates an upcoming discussion on five specific categorical encoding methods: one-hot, label, target, ordered target, and the unique approach employed by CatBoost. This sets the stage for a deeper exploration of how to handle these prevalent data types effectively.
Key takeaway
For machine learning engineers starting a tabular project, be aware that categorical features like country codes or merchant IDs are a guaranteed initial hurdle. Your early data preparation efforts must account for these prevalent data types, as their proper handling is foundational for subsequent model development. Plan to dedicate significant attention to understanding and addressing these features.
Key insights
Categorical features are ubiquitous in tabular ML projects.
Topics
- Categorical Features
- Feature Encoding
- Tabular Machine Learning
- CatBoost
- One-Hot Encoding
- Label Encoding
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.