The Chinchilla Scaling Laws

Data Science

On this edition of SuperDataScience hosted by our Chief Data Scientist, Jon Krohn, he revisits the topic of Chinchilla Scaling Laws, originally introduced in episode 670 while discussing the LLaMA model architecture. Derived from a study by Google DeepMind, these laws are crucial for understanding the efficient training of large language models, particularly those driving the likes of ChatGPT and GPT-4.

Named in juxtaposition to the larger model Gopher, Chinchilla is a 70 billion parameter model. It leverages the scaling laws to optimize data training, offering superior performance compared to its larger counterparts. This illustrates that larger isn’t always better; optimal training and data ratios can yield more efficient and equally powerful models.

Researchers trained 400 transformer architectures, ranging from 70 million to 16 billion parameters, using datasets of 5 to 500 billion tokens. They established a compute-optimal ratio of 20:1 for tokens to model parameters. This ratio means that for a model with a billion parameters, one should ideally train it with 20 billion tokens, providing a standard to ensure models are both robust and cost-effective.

The implications of Chinchilla Scaling Laws are profound. They suggest a shift from merely expanding model sizes to focusing on optimizing data training, which could lead to cost-effective and powerful models for a broader range of applications. Moreover, the emergence of open-source models like Cerebras-GPT, which adhere to these laws, further democratizes access to high-quality language models.

While the Chinchilla Scaling Laws offer a path to more efficient large language model training, they also indicate a looming “AI Brick Wall” — the practical limits of scaling due to astronomical costs and data requirements. Innovators must consider both the potential and the limitations as they push the boundaries of AI capabilities.

This episode sheds light on the significant strides and considerations in optimizing large language models. As the field continues to evolve rapidly, these insights not only guide researchers and developers but also spark curiosity and innovation in utilizing these powerful tools.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at



Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post

The A.I. and Machine Learning Landscape, with Investor George Mathew

The A.I. and Machine Learning Landscape, with Investor George Mathew.

read full post