XGBoost: The Ultimate Classifier, with Matt Harrison

Data Science

XGBoost is typically the most powerful ML option whenever you’re working with structured data. In this SuperDataScience episode, our Chief Data Scientist, Jon Krohn, talks to world-leading XGBoost expert, Matt Harrison, on how it works and how to make the most of it.

• Is the author of seven best-selling books on Python and Machine Learning.
• His most recent book, “Effective XGBoost”, was published in March.
• Teaches “Exploratory Data Analysis with Python” at Stanford University.
• Through his consultancy MetaSnake, he’s taught Python at leading global organizations like NASA, Netflix, and Qualcomm.
• Previously worked as a CTO and Software Engineer.
• Holds a degree in Computer Science from Stanford.

This episode will appeal primarily to practicing data scientists who are keen to learn about XGBoost or keen to become an even deeper expert on XGBoost by learning about it from a world-leading educator on the library.

In this episode, Matt details:
• Why XGBoost is the go-to library for attaining the highest accuracy when building a classification model.
• Modeling situations where XGBoost should not be your first choice.
• The XGBoost hyperparameters to adjust to squeeze every bit of juice out of your tabular training data and his recommended library for automating hyperparameter selection.
• His top Python libraries for other XGBoost-related tasks such as data preprocessing, visualizing model performance, and model explainability.
• Languages beyond Python that have convenient wrappers for applying XGBoost.
• Best practices for communicating XGBoost results to non-technical stakeholders.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.


Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

The Chinchilla Scaling Laws

The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, our Chief Data Scientist, Jon Krohn, covers this ratio and the LLMs that have arisen from it.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post