This lesson delves into the process of fine-tuning machine learning models by providing high-quality data. OpenAI recommends using a few hundred expert-vetted examples for optimal performance. Whether it's support tickets or customer service emails, gathering relevant data and having humans verify its accuracy is crucial. JSON-L format is ideal for model training, with each line containing valid JSON ending with a newline character. While JSON-L formatting may seem tedious, OpenAI offers a CLI data preparation tool that simplifies the process and ensures proper formatting.
It also explores the importance of separators to indicate prompt boundaries, the use of whitespace to start completions, and the necessity of stop sequences to denote completion endpoints. Tom emphasizes that OpenAI's CLI tool handles the conversion of data into the required format, assisting learners in successfully preparing their datasets for fine-tuning purposes.
[00:00] Okay, so we need some data to fine-tune our model, so let's deal with that next. And the first question is, how much data do you need to fine-tune a model? Well, the advice from OpenAI is you should provide at least a few hundred high-quality examples
[00:17] ideally vetted by human experts. And what's more, OpenAI says increasing the number of examples is usually the best and most reliable way of improving performance. So if you were setting this up in the real world, you would basically grab all of your company's support tickets and customer service emails
[00:36] and anything else relevant that you could lay your hands on. And then, as OpenAI recommends, you would have a human check it for accuracy and relevance. Now, for the app we're building today, we're going to use a relatively small amount of data, but the principle is exactly the same. That is what I want to show you, the principle of how to fine-tune data.
[00:55] Now, how we format that data is really important. OpenAI wants the data to be in JSON-L format, and the docs give us an example. Now, JSON-L is data formatted with JSON on each line.
[01:09] Each line has to be valid JSON in its own right, and each line must end with a newline character. Now, if you haven't used JSON-L before, don't worry. We won't be writing it ourselves. We'll be using a special tool to create it. We'll actually be working with much simpler CSV or comma-separated values data,
[01:28] and we'll let OpenAI's tool do the heavy lifting. Now, as well as wanting JSON-L format, OpenAI gives us some further criteria for the format of our data. It wants each prompt to end with a separator to show where the prompt ends and the completion begins.
[01:46] It wants each completion to start with a white space, and it wants each completion to end with a stop sequence to inform the model where the completion ends. Now, the stop sequence is something that we haven't needed yet, but we will talk about it when we get to that point in the project, and everything will become clear.
[02:04] To be honest, all of this sounds like a bit of a pain, and actually, in their example data, they didn't even show us exactly what they wanted with the stop sequences and the separator, so we don't even have that to help us. But, good news is right here.
[02:20] You can use our CLI data preparation tool to easily convert your data into this format. So, we are going to let that tool do everything for us. So, don't worry at all if that looks intimidating. Okay, in the next scrim, let's take a look at the data we'll be using.
Member comments are a way for members to communicate, interact, and ask questions about a lesson.
The instructor or someone from the community might respond to your question Here are a few basic guidelines to commenting on egghead.io
Be on-Topic
Comments are for discussing a lesson. If you're having a general issue with the website functionality, please contact us at support@egghead.io.
Avoid meta-discussion
Code Problems?
Should be accompanied by code! Codesandbox or Stackblitz provide a way to share code and discuss it in context
Details and Context
Vague question? Vague answer. Any details and context you can provide will lure more interesting answers!