Where CatBoost beats XGBoost and LightGBM — and what the book is honest about

· Source: Valeriy’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

This analysis compares CatBoost against XGBoost and LightGBM, drawing on architectural details from the book "Mastering CatBoost" and independent benchmark studies. While no single library dominates all regression tasks, CatBoost demonstrates the strongest mean rank among GBDTs across diverse tabular datasets, including 19 best-score wins (17.1%) in Shmuel et al.'s 111 datasets and a mean rank of 5.06 in McElfresh et al.'s 176 datasets. It particularly excels with high-cardinality categorical features, mixed-type features, and moderate-to-high noise, attributed to its native ordered-target-statistics encoding and ordered boosting. LightGBM is favored for training speed on large datasets and low memory footprint, while XGBoost benefits from community adoption, flexible tree-construction, and strong GPU implementations. The book emphasizes that performance is dataset-dependent, advocating for direct benchmarking.

Key takeaway

For Machine Learning Engineers evaluating gradient boosting models, if your datasets include high-cardinality categorical features or require robust quantile regression, you should prioritize benchmarking CatBoost. Its native ordered-target-statistics encoding and ordered boosting offer distinct advantages. However, always conduct a fair comparison against LightGBM for speed and XGBoost for ecosystem maturity on your specific workload to determine the optimal choice, as no single library universally outperforms others.

Key insights

CatBoost leads GBDT rankings on diverse tabular datasets, excelling with categorical features due to its unique architectural decisions.

Principles

Method

Compare CatBoost, XGBoost, and LightGBM using identical preprocessing, comparable hyperparameter searches, and the same evaluation protocol.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, Data Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.