在最近的更新中,OpenAI推出了Fine-tune(微调)功能,允许我们基于自己的数据对GPT-3.5 Turbo进行微调训练。微调训练是一个强大的工具,可以使GPT模型适应特定领域或任务。本文将为你介绍如何使用OpenAI官方文档中的步骤来进行基于gpt-3.5-turbo的微调训练。
准备数据
在进行微调训练之前,首先需要准备好你的数据。数据的格式应该如下所示:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
数据中包含了对话消息,每个消息都有一个"role"表示角色(可以是system、user或assistant),以及"content"表示消息内容。
格式化并验证数据
在载入数据之后,我们需要对数据进行格式化和验证,确保其符合Chat completions消息的结构。下面是一个Python脚本示例,用于格式化和验证数据:
import json
from collections import defaultdict
# Next, we specify the data path and open the JSONL file
data_path = "<YOUR_JSON_FILE_HERE>"
# Load dataset
with open(data_path) as f:
dataset = [json.loads(line) for line in f]
# We can inspect the data quickly by checking the number of examples and the first item
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
print(message)
# 省略了数据格式化和验证的代码,这部分代码用于确保数据的正确性和合法性
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
以上代码会帮助你检查数据是否符合要求,并输出任何格式错误。
数据长度检查
在进行微调训练之前,还需要检查数据的长度是否超过了4096个token的限制。下面是一个示例代码,用于计算数据中各个对话的token数量并进行检查:
import tiktoken
import numpy as np
# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")
# ...
# Last, we can look at the results of the different formatting operations before proceeding with creating a fine-tuning job:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []
for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")
以上代码将帮助你确保数据的长度不会超过4096个token的限制,否则需要在微调训练中进行截断处理。
上传数据文件
在验证数据后,你需要将数据文件上传到OpenAI平台,以便进行微调训练。可以使用OpenAI SDK的以下代码来上传文件:
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
创建微调作业
接下来,你可以使用OpenAI SDK来创建微调作业。以下是示例代码:
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
使用微调模型
微调完成后,你可以使用微调后的模型来进行对话生成。以下是一个示例代码:
completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
通过上述代码,你可以与微调后的模型进行互动,并生成自定义的对话内容。
微调模型的价格
微调模型的成本包括初始训练成本和使用成本:
- 培训:0.008美元/1K tokens
- 使用输入:0.012美元/1K tokens
- 使用输出:0.016美元/1K tokens
根据你的微调作业的规模和使用量,你可以估算微调模型的成本。请参考OpenAI的定价页面来了解更多细节。
通过本文的教程,你可以开始使用OpenAI GPT-3.5 Turbo进行微调训练,以适应你的特定需求和任务。祝你成功!