5 categorical encodings, 5 wrong answers: one-hot, label, target, ordered target, and what CatBoost actually does

· Source: Valeriy’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

The provided content introduces categorical features as a ubiquitous challenge in tabular machine learning projects, citing examples like country codes, device types, browsers, postcodes, merchant IDs, customer segments, and SKUs. It highlights their common appearance at the outset of any structured data analysis. The title further indicates an upcoming discussion on five specific categorical encoding methods: one-hot, label, target, ordered target, and the unique approach employed by CatBoost. This sets the stage for a deeper exploration of how to handle these prevalent data types effectively.

Key takeaway

For machine learning engineers starting a tabular project, be aware that categorical features like country codes or merchant IDs are a guaranteed initial hurdle. Your early data preparation efforts must account for these prevalent data types, as their proper handling is foundational for subsequent model development. Plan to dedicate significant attention to understanding and addressing these features.

Key insights

Categorical features are ubiquitous in tabular ML projects.

Topics

Best for: Machine Learning Engineer, Data Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.