If you’ve been using fine-tuned open-source LLMs (e.g. for generative A.I. functionality or natural-language conversations with your users), it’s very likely time you switch your starting model over to Llama 2.
Here’s why:
• It’s open-source and, unlike the original LLaMA, can be used commercially.
• Like the Alpaca and Vicuña models that used LLaMA 1 as pretrained starting point, the “Llama 2-chat” variants are fine-tuned for chat applications (using a data set of over 1 million human annotations).
• For both pre-trained and chat-fine-tuned variants, the Llama 2 model family has four sizes: 7 billion, 13 billion (fits on a single GPU), 34 billion (not released publicly) and 70 billion model parameters (best performance on NLG benchmark tasks).
• The 70B chat-fine-tuned variant offers ChatGPT-level performance on a broad range of natural-language benchmarks (it’s the first open-source model to do this convincingly; you can experience this yourself via the free Hugging Face chat interface where Llama-2-70B-chat has become the default) and is generally now the leading open-source LLM.
• See the Llama 2 page for a table of details across 11 external benchmarks, which (according to Meta themselves so perhaps take with a grain of salt) shows how 13B Llama 2 is comparable to 40B Falcon, the previous top-ranked open-source LLM across a range of benchmarks. The 70B Llama 2 sets the new state of the art, on some benchmarks by a considerable margin (N.B.: on tasks involving code or math, Llama 2 is not necessarily the best open-source option out there, however.)
• Time awareness: “Is earth flat or round?” in 2023 versus “in 800 CE context” relates to different answers.
• Has double the context window (4k tokens) of the original LLaMA, which is a big jump from about eight pages to 16 pages of context.
• Uses a two-stage RLHF (reinforcement learning from human feedback) approach that is key to its outstanding generative capacity.
• A new method called “Ghost Attention” (GAtt) allows it to perform especially well in “multi-turn” (ongoing back and forth) conversation.
• Extensive safety and alignment testing (probably more extensive than any other open-source LLM), including (again, Meta self-reported) charts from the Llama 2 technical paper showing A.I. safety violation percentages far below any other open-source LLM and even better than ChatGPT. (The exception being the 34B Llama 2 model, which perhaps explains why this is the only Llama 2 model size that Meta didn’t release publicly.)
Like Hugging Face, at my company Nebula.io we’ve switched to Llama 2 as the starting point for our task-specific fine-tuning and have been blown away.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.