7.2. Fine-Tuning to follow instructions
Ο στόχος αυτής της ενότητας είναι να δείξει πώς να βελτιστοποιήσετε ένα ήδη προεκπαιδευμένο μοντέλο για να ακολουθεί οδηγίες αντί να παράγει απλώς κείμενο, για παράδειγμα, απαντώντας σε εργασίες ως chatbot.
Dataset
Για να βελτιστοποιήσετε ένα LLM ώστε να ακολουθεί οδηγίες, είναι απαραίτητο να έχετε ένα σύνολο δεδομένων με οδηγίες και απαντήσεις για να βελτιστοποιήσετε το LLM. Υπάρχουν διάφορες μορφές για να εκπαιδεύσετε ένα LLM να ακολουθεί οδηγίες, για παράδειγμα:
Το παράδειγμα στυλ προτροπής Apply Alpaca:
Παράδειγμα Στυλ Προτροπής Phi-3:
Η εκπαίδευση ενός LLM με αυτούς τους τύπους συνόλων δεδομένων αντί για απλό κείμενο βοηθά το LLM να κατανοήσει ότι πρέπει να δίνει συγκεκριμένες απαντήσεις στις ερωτήσεις που λαμβάνει.
Επομένως, ένα από τα πρώτα πράγματα που πρέπει να κάνετε με ένα σύνολο δεδομένων που περιέχει αιτήματα και απαντήσεις είναι να μοντελοποιήσετε αυτά τα δεδομένα στη επιθυμητή μορφή προτροπής, όπως:
Then, as always, it's needed to separate the dataset in sets for training, validation and testing.
Batching & Data Loaders
Then, it's needed to batch all the inputs and expected outputs for the training. For this, it's needed to:
Tokenize the texts
Pad all the samples to the same length (usually the length will be as big as the context length used to pre-train the LLM)
Create the expected tokens by shifting 1 the input in a custom collate function
Replace some padding tokens with -100 to exclude them from the training loss: After the first
endoftext
token, substitute all the otherendoftext
tokens by -100 (because usingcross_entropy(...,ignore_index=-100)
means that it'll ignore targets with -100)[Optional] Mask using -100 also all the tokens belonging to the question so the LLM learns only how to generate the answer. In the Apply Alpaca style this will mean to mask everything until
### Response:
With this created, it's time to crate the data loaders for each dataset (training, validation and test).
Load pre-trained LLM & Fine tune & Loss Checking
It's needed to load a pre-trained LLM to fine tune it. This was already discussed in other pages. Then, it's possible to use the previously used training function to fine tune the LLM.
During the training it's also possible to see how the training loss and validation loss varies during the epochs to see if the loss is getting reduced and if overfitting is ocurring. Remember that overfitting occurs when the training loss is getting reduced but the validation loss is not being reduced or even increasing. To avoid this, the simplest thing to do is to stop the training at the epoch where this behaviour start.
Response Quality
As this is not a classification fine-tune were it's possible to trust more the loss variations, it's also important to check the quality of the responses in the testing set. Therefore, it's recommended to gather the generated responses from all the testing sets and check their quality manually to see if there are wrong answers (note that it's possible for the LLM to create correctly the format and syntax of the response sentence but gives a completely wrong response. The loss variation won't reflect this behaviour). Note that it's also possible to perform this review by passing the generated responses and the expected responses to other LLMs and ask them to evaluate the responses.
Other test to run to verify the quality of the responses:
Measuring Massive Multitask Language Understanding (MMLU): MMLU evaluates a model's knowledge and problem-solving abilities across 57 subjects, including humanities, sciences, and more. It uses multiple-choice questions to assess understanding at various difficulty levels, from elementary to advanced professional.
LMSYS Chatbot Arena: This platform allows users to compare responses from different chatbots side by side. Users input a prompt, and multiple chatbots generate responses that can be directly compared.
AlpacaEval: AlpacaEval is an automated evaluation framework where an advanced LLM like GPT-4 assesses the responses of other models to various prompts.
General Language Understanding Evaluation (GLUE): GLUE is a collection of nine natural language understanding tasks, including sentiment analysis, textual entailment, and question answering.
SuperGLUE: Building upon GLUE, SuperGLUE includes more challenging tasks designed to be difficult for current models.
Beyond the Imitation Game Benchmark (BIG-bench): BIG-bench is a large-scale benchmark with over 200 tasks that test a model's abilities in areas like reasoning, translation, and question answering.
Holistic Evaluation of Language Models (HELM): HELM provides a comprehensive evaluation across various metrics like accuracy, robustness, and fairness.
OpenAI Evals: An open-source evaluation framework by OpenAI that allows for the testing of AI models on custom and standardized tasks.
HumanEval: A collection of programming problems used to evaluate code generation abilities of language models.
Stanford Question Answering Dataset (SQuAD): SQuAD consists of questions about Wikipedia articles, where models must comprehend the text to answer accurately.
TriviaQA: A large-scale dataset of trivia questions and answers, along with evidence documents.
and many many more
Follow instructions fine-tuning code
You can find an example of the code to perform this fine tuning in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py
References
Last updated