Scikit-learn’s Past, Present and Future, with scikit-learn co-founder Dr. Gaël Varoquaux

Data Science

In this episode of SuperDataScience, our Chief Data Scientist, Jon Krohn, traveled to Paris to interview Dr. Gael Varoquaux, co-founder of scikit-learn, the standard library for machine learning worldwide (downloaded over 1.4 million times PER DAY).

More on Gaël:

• Actively leads the development of the ubiquitous scikit-learn Python library today, which has several thousand people contributing open-source code to it.

• Is Research Director at the famed Inria (the French National Institute for Research in Digital Science and Technology), where he leads the Soda (“social data”) team that is focused on making a major positive social impact with data science.

• Has been recognized with the Innovation Prize from the French Academy of Sciences and many other awards for his invaluable work.

This episode will likely be of primary interest to hands-on practitioners like data scientists and ML engineers, but anyone who’d like to understand the cutting edge of open-source machine learning should listen in.

In this episode, Gaël details:

• The genesis, present capabilities and fast-moving future direction of scikit-learn.

• How to best apply scikit-learn to your particular ML problem.

• How ever-larger datasets and GPU-based accelerations impact the scikit-learn project.

• How (whether you write code or not!) you can get started on contributing to a mega-impactful open-source project like scikit-learn yourself.

• Hugely successful social-impact data projects his Soda lab has had recently.

• Why statistical rigor is more important than ever and how software tools could nudge us in the direction of making more statistically sound decisions.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

 

Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

The Chinchilla Scaling Laws

The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, our Chief Data Scientist, Jon Krohn, covers this ratio and the LLMs that have arisen from it.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post