Full Encoder-Decoder Transformers Fully Explained, with Kirill Eremenko

Data Science

In February 2024, Kirill Eremenko was on the SuperDataScience Podcast hosted by our Chief Data Scientist, Jon Krohn, to detail Decoder-Only Transformers (like the GPT series). It was Jon’s most popular episode ever, so Kirill came right back to detail an even more sophisticated architecture: Encoder-Decoder Transformers.

If you don’t already know him, Kirill:

• Is Founder and CEO of SuperDataScience, an e-learning platform that is the namesake of this podcast.

• Founded the Super Data Science Podcast in 2016 and hosted the show until he passed Jon the reins a little over three years ago.

• Has reached more than 2.7 million students through the courses he’s published on Udemy, making him Udemy’s most popular data science instructor.

Kirill was most recently on the show for Episode #747 to provide a technical introduction to the Transformer module that underpins all the major modern Large Language Models (LLMs) like the GPT, Gemini, Llama and BERT architectures.

That episode, #747, as well as today’s, are perhaps the two most technical episodes of this podcast ever so they probably appeal mostly to hands-on practitioners like data scientists and ML engineers, particularly those who already have some understanding of deep neural networks.

In this episode, Kirill:

• Reviews the key Transformer theory that we covered in Episode #747, namely the individual neural-network components of the Decoder-Only architecture that prevails in generative LLMs like the GPT series models.

• Builds on that to detail the full, Encoder-Decoder Transformer architecture that is used in the original Transformer by Google, in their “Attention is All You Need” paper, as well as in other models that excel at both natural-language understanding and generation such as T5 and BART.

• Discusses the performance and capability pros and cons of full Encoder-Decoder architectures relative to Decoder-Only architectures like GPT and Encoder-Only architectures like BERT.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.


Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

The Chinchilla Scaling Laws

The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, our Chief Data Scientist, Jon Krohn, covers this ratio and the LLMs that have arisen from it.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post