Technical Intro to Transformers and LLMs, with Kirill Eremenko

Data Science

For this SuperDataScience episode hosted by our Chief Data Scientist, Jon Krohn, the indefatigable SuperDataScience Founder Kirill Eremenko gives a detailed technical intro to Transformers and how they’re scaled up to allow Large-Language Models like GPT-4, Llama 2 and Gemini to have their mind-blowing abilities.

If you don’t already know him, Kirill:

• Is Founder and CEO of SuperDataScience, an e-learning platform that is the namesake of this podcast.

• Founded the SuperDataScience Podcast in 2016 and hosted the show until he passed me the reins three years ago.

• Has reached more than 2.6 million students through the courses he’s published on Udemy, making him Udemy’s most popular data science instructor.

Today’s episode is perhaps the most technical episode of this podcast ever so it will probably appeal mostly to hands-on practitioners like data scientists and ML engineers, particularly those who already have some understanding of deep learning.

In this episode, Kirill details:

• The history of the Attention mechanism in natural-language models.

• How compute-efficient Attention is enabled by the Transformer, a transformative deep-neural-network architecture.

• How Transformers work, across each of five distinct data-processing stages.

• How Transformers are scaled up to power the mind-blowing capabilities of LLMs such as modern Generative A.I. models.

• Why knowing all of this is so helpful — and lucrative — in a data science career.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at



Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

The Chinchilla Scaling Laws

The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, our Chief Data Scientist, Jon Krohn, covers this ratio and the LLMs that have arisen from it.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post