7.2. Fine-Tuning to follow instructions

このセクションの目的は、テキストを生成するだけでなく、指示に従うように既に事前トレーニングされたモデルをファインチューニングする方法を示すことです。たとえば、チャットボットとしてタスクに応答することです。

データセット

指示に従うようにLLMをファインチューニングするためには、指示と応答を含むデータセットが必要です。指示に従うようにLLMをトレーニングするための異なるフォーマットがあります。たとえば：

Apply Alpacaプロンプトスタイルの例：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Calculate the area of a circle with a radius of 5 units.

### Response:
The area of a circle is calculated using the formula \( A = \pi r^2 \). Plugging in the radius of 5 units:

\( A = \pi (5)^2 = \pi \times 25 = 25\pi \) square units.

Phi-3 プロンプトスタイルの例:

<|User|>
Can you explain what gravity is in simple terms?

<|Assistant|>
Absolutely! Gravity is a force that pulls objects toward each other.

トレーニングデータセットを生のテキストだけでなく、このようなデータセットでLLMをトレーニングすることで、LLMは受け取った質問に対して具体的な応答をする必要があることを理解します。

したがって、リクエストと回答を含むデータセットで最初に行うべきことの1つは、そのデータを希望するプロンプト形式にモデル化することです。例えば：

# Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb
def format_input(entry):
instruction_text = (
f"Below is an instruction that describes a task. "
f"Write a response that appropriately completes the request."
f"\n\n### Instruction:\n{entry['instruction']}"
)

input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

return instruction_text + input_text

model_input = format_input(data[50])

desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Then, as always, it's needed to separate the dataset in sets for training, validation and testing.

Batching & Data Loaders

Then, it's needed to batch all the inputs and expected outputs for the training. For this, it's needed to:

テキストをトークン化する
すべてのサンプルを同じ長さにパディングする（通常、長さはLLMの事前トレーニングに使用されるコンテキストの長さと同じくらい大きくなる）
カスタムコレート関数で入力を1つシフトして期待されるトークンを作成する
トレーニングロスから除外するために、いくつかのパディングトークンを-100に置き換える：最初のendoftextトークンの後、他のすべてのendoftextトークンを-100に置き換える（cross_entropy(...,ignore_index=-100)を使用することは、-100のターゲットを無視することを意味する）
[オプション] LLMが回答を生成する方法だけを学ぶように、質問に属するすべてのトークンを-100でマスクする。Apply Alpacaスタイルでは、### Response:までのすべてをマスクすることを意味する

これが作成されたら、各データセット（トレーニング、バリデーション、テスト）のデータローダーを作成する時が来た。

Load pre-trained LLM & Fine tune & Loss Checking

It's needed to load a pre-trained LLM to fine tune it. This was already discussed in other pages. Then, it's possible to use the previously used training function to fine tune the LLM.

During the training it's also possible to see how the training loss and validation loss varies during the epochs to see if the loss is getting reduced and if overfitting is ocurring. Remember that overfitting occurs when the training loss is getting reduced but the validation loss is not being reduced or even increasing. To avoid this, the simplest thing to do is to stop the training at the epoch where this behaviour start.

Response Quality

As this is not a classification fine-tune were it's possible to trust more the loss variations, it's also important to check the quality of the responses in the testing set. Therefore, it's recommended to gather the generated responses from all the testing sets and check their quality manually to see if there are wrong answers (note that it's possible for the LLM to create correctly the format and syntax of the response sentence but gives a completely wrong response. The loss variation won't reflect this behaviour). Note that it's also possible to perform this review by passing the generated responses and the expected responses to other LLMs and ask them to evaluate the responses.

Other test to run to verify the quality of the responses:

Measuring Massive Multitask Language Understanding (MMLU): MMLU evaluates a model's knowledge and problem-solving abilities across 57 subjects, including humanities, sciences, and more. It uses multiple-choice questions to assess understanding at various difficulty levels, from elementary to advanced professional.
LMSYS Chatbot Arena: このプラットフォームでは、ユーザーが異なるチャットボットの応答を並べて比較できます。ユーザーはプロンプトを入力し、複数のチャットボットが生成した応答を直接比較できます。
AlpacaEval: AlpacaEvalは、GPT-4のような高度なLLMが他のモデルの応答をさまざまなプロンプトに対して評価する自動評価フレームワークです。
General Language Understanding Evaluation (GLUE): GLUEは、感情分析、テキストの含意、質問応答など、9つの自然言語理解タスクのコレクションです。
SuperGLUE: GLUEを基にして、SuperGLUEは現在のモデルにとって難しいように設計されたより挑戦的なタスクを含んでいます。
Beyond the Imitation Game Benchmark (BIG-bench): BIG-benchは、推論、翻訳、質問応答などの分野でモデルの能力をテストする200以上のタスクを持つ大規模なベンチマークです。
Holistic Evaluation of Language Models (HELM): HELMは、精度、堅牢性、公平性など、さまざまな指標にわたる包括的な評価を提供します。
OpenAI Evals: OpenAIによるオープンソースの評価フレームワークで、カスタムおよび標準化されたタスクでAIモデルをテストできます。
HumanEval: プログラミング問題のコレクションで、言語モデルのコード生成能力を評価するために使用されます。
Stanford Question Answering Dataset (SQuAD): SQuADは、Wikipediaの記事に関する質問で構成されており、モデルは正確に回答するためにテキストを理解する必要があります。
TriviaQA: トリビアの質問と回答の大規模データセットで、証拠文書も含まれています。

and many many more

Follow instructions fine-tuning code

You can find an example of the code to perform this fine tuning in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py

References

https://www.manning.com/books/build-a-large-language-model-from-scratch

Previous7.1. Fine-Tuning for Classification NextBurp Suite

Last updated 5 hours ago