ChatGPT Code Interpreter: 5 Hacks for Data Scientists

Data Science

The ChatGPT Code Interpreter is surreal: It creates and executes Python code for whatever task you describe, debugs its own runtime errors, displays charts, does file uploads/downloads, and suggests sensible next steps all along the way.

Whether you write code yourself today or not, you can take advantage of GPT-4’s stellar natural-language input/output capabilities to interact with the Code Interpreter. The mind-blowing experience is equivalent to having an expert data analyst, data scientist or software developer with you to instantaneously respond to your questions or requests.

As an example of these jaw-dropping capabilities, our Chief Data Scientist, Jon Krohn, uses this SuperDataScience episode to demonstrate the ChatGPT Code Interpreter’s full automation of data analysis and machine learning. If you watch the episode on YouTube, you can even see the Code Interpreter hands-on in action while he interacts with it solely with natural language.

Over the course of this episode/video, the Code Interpreter:
1. Receives a sample data file that he provides.
2. Uses natural language to describe all of the variables that are in the file.
3. Performs a four-step Exploratory Data Analysis (EDA), including histograms, scatterplots that compare key variables and key summary statistics (all explained in natural language).
4. Preprocesses all of my variables for machine learning.
5. Selects an appropriate baseline ML model, trains it and quantitatively evaluates its performance.
6. Suggests alternative models and approaches (e.g., grid search) to get even better performance and then automatically carries these out.
7. Optionally provides Python code every step of the way and is delighted to answer any questions he has about the code.

Even as an experienced data scientist, however, Jon would estimate that in many everyday situations use of the Code Interpreter could decrease his development time by a crazy 90% or more.

The big caveat with all of this is whether you’re comfortable sharing your code with OpenAI. Jon wouldn’t provide proprietary company code to it without clearing it with your firm first and — if you do use proprietary code with it — turn “Chat history & training” off in your ChatGPT Plus settings. To circumnavigate the data-privacy issue entirely, you could alternatively try Meta’s newly-released “Code Llama — Instruct 34B” Large Language Model on your own infrastructure. Code Llama won’t, however, be as good as the Code Interpreter in many circumstances and will require some technical savvy to get it up and running.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

 

 

Getting Value From A.I.

In February 2023, our Chief Data Scientist, Jon Krohn, delivered this keynote on “Getting Value from A.I.” to open the second day of Hg Capital’s “Digital Forum” in London.

read full post

The Chinchilla Scaling Laws

The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, our Chief Data Scientist, Jon Krohn, covers this ratio and the LLMs that have arisen from it.

read full post

StableLM: Open-Source “ChatGPT”-Like LLMs You Can Fit on One GPU

The folks who open-sourced Stable Diffusion have now released “StableLM”, their first Language Models. Pre-trained on an unprecedented amount of data for single-GPU LLMs (1.5 trillion tokens!), these are small but mighty.

read full post