CatBoost: Powerful, efficient ML for large tabular datasets

CatBoost is making waves in open-source ML as it’s often the top approach for tasks as diverse as classification, regression, ranking, and recommendation. This is especially so if working with tabular data that include categorical variables.

This justifiable excitement in mind, this “Five-Minute Friday” episode of SuperDataScience hosted by our Chief Data Scientist, Jon Krohn, is dedicated to CatBoost (short for “category” and “boosting”).

CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. In a nutshell, CatBoost — like the more established (and regularly Kaggle-leaderboard-topping approaches) XGBoost and LightGBM — is at its heart a decision-tree algorithm that leverages gradient boosting. So that explains the “boost” part of CatBoost.

The “cat” (“category”) part comes from CatBoost’s superior handling of categorical features. If you’ve trained models with categorical data before, you’ve likely experienced the tedium of preprocessing and feature engineering with categorical data. CatBoost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. CatBoost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding.

In addition to CatBoost’s superior handling of categorical features, the algorithm also makes use of:
• A specialized gradient-based optimization scheme known as Ordered Boosting that takes advantage of the natural ordering of categorical variables to minimize the loss function efficiently.
• Symmetric decision trees, which have a fixed tree depth that enables a faster training time relative to XGBoost and a comparable training time to LightGBM, which is famous for its speed.
• Regularization techniques, such as the well-known L2 regularization as well as ordered boosting and symmetric trees already discussed, all together make CatBoost unlikely to overfit to training data relative to other boosted-tree algorithms.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

CatBoost: Powerful, efficient ML for large tabular datasets

Data Science

Getting Value From A.I.

The Chinchilla Scaling Laws

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU