In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into learning the tokens it needs to generate.
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
The goal of this sixth phase is very simple: Train the model from scratch. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
Text Evaluation
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via backpropagation. This requires a loss function to maximize. In this case, the function will be the difference between the performed prediction and the desired one.
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base e) of 7.4541e-05 is approximately -9.5042.
Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
Therefore, after performing the natural logarithm to each prediction, the average is calculated, the minus symbol removed (this is called cross entropy loss) and thats the number to reduce as close to 0 as possible because the natural logarithm of 1 is 0:
Another way to measure how good the model is is called perplexity. Perplexity is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the model's uncertainty when predicting the next token in a sequence.
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
Previous code used here but already explained in previous sections
"""This is code explained before so it won't be exaplained"""import tiktokenimport torchimport torch.nn as nnfrom torch.utils.data import Dataset, DataLoaderclassGPTDatasetV1(Dataset):def__init__(self,txt,tokenizer,max_length,stride): self.input_ids = [] self.target_ids = []# Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})# Use a sliding window to chunk the book into overlapping sequences of max_lengthfor i inrange(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i +1: i + max_length +1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk))def__len__(self):returnlen(self.input_ids)def__getitem__(self,idx):return self.input_ids[idx], self.target_ids[idx]defcreate_dataloader_v1(txt,batch_size=4,max_length=256,stride=128,shuffle=True,drop_last=True,num_workers=0):# Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2")# Create dataset dataset =GPTDatasetV1(txt, tokenizer, max_length, stride)# Create dataloader dataloader =DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)return dataloaderclassMultiHeadAttention(nn.Module):def__init__(self,d_in,d_out,context_length,dropout,num_heads,qkv_bias=False):super().__init__()assert d_out % num_heads ==0,"d_out must be divisible by n_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out)# Linear layer to combine head outputs self.dropout = nn.Dropout(dropout) self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))defforward(self,x): b, num_tokens, d_in = x.shape keys = self.W_key(x)# Shape: (b, num_tokens, d_out) queries = self.W_query(x) values = self.W_value(x)# We implicitly split the matrix by adding a `num_heads` dimension# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) keys = keys.transpose(1, 2) queries = queries.transpose(1, 2) values = values.transpose(1, 2)# Compute scaled dot-product attention (aka self-attention) with a causal mask attn_scores = queries @ keys.transpose(2, 3)# Dot product for each head# Original mask truncated to the number of tokens and converted to boolean mask_bool = self.mask.bool()[:num_tokens,:num_tokens]# Use the mask to fill attention scores attn_scores.masked_fill_(mask_bool, -torch.inf) attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights)# Shape: (b, num_tokens, num_heads, head_dim) context_vec = (attn_weights @ values).transpose(1, 2)# Combine heads, where self.d_out = self.num_heads * self.head_dim context_vec = context_vec.reshape(b, num_tokens, self.d_out) context_vec = self.out_proj(context_vec)# optional projectionreturn context_vecclassLayerNorm(nn.Module):def__init__(self,emb_dim):super().__init__() self.eps =1e-5 self.scale = nn.Parameter(torch.ones(emb_dim)) self.shift = nn.Parameter(torch.zeros(emb_dim))defforward(self,x): mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, keepdim=True, unbiased=False) norm_x = (x - mean) / torch.sqrt(var + self.eps)return self.scale * norm_x + self.shiftclassGELU(nn.Module):def__init__(self):super().__init__()defforward(self,x):return0.5* x * (1+ torch.tanh( torch.sqrt(torch.tensor(2.0/ torch.pi)) * (x +0.044715* torch.pow(x, 3)) ))classFeedForward(nn.Module):def__init__(self,cfg):super().__init__() self.layers = nn.Sequential( nn.Linear(cfg["emb_dim"], 4* cfg["emb_dim"]),GELU(), nn.Linear(4* cfg["emb_dim"], cfg["emb_dim"]), )defforward(self,x):return self.layers(x)classTransformerBlock(nn.Module):def__init__(self,cfg):super().__init__() self.att =MultiHeadAttention( d_in=cfg["emb_dim"], d_out=cfg["emb_dim"], context_length=cfg["context_length"], num_heads=cfg["n_heads"], dropout=cfg["drop_rate"], qkv_bias=cfg["qkv_bias"]) self.ff =FeedForward(cfg) self.norm1 =LayerNorm(cfg["emb_dim"]) self.norm2 =LayerNorm(cfg["emb_dim"]) self.drop_shortcut = nn.Dropout(cfg["drop_rate"])defforward(self,x):# Shortcut connection for attention block shortcut = x x = self.norm1(x) x = self.att(x)# Shape [batch_size, num_tokens, emb_size] x = self.drop_shortcut(x) x = x + shortcut # Add the original input back# Shortcut connection for feed-forward block shortcut = x x = self.norm2(x) x = self.ff(x) x = self.drop_shortcut(x) x = x + shortcut # Add the original input backreturn xclassGPTModel(nn.Module):def__init__(self,cfg):super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ inrange(cfg["n_layers"])]) self.final_norm =LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)defforward(self,in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x)return logits
# Download contents to train the data withimport osimport urllib.requestfile_path ="the-verdict.txt"url ="https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"ifnot os.path.exists(file_path):with urllib.request.urlopen(url)as response: text_data = response.read().decode('utf-8')withopen(file_path, "w", encoding="utf-8")as file: file.write(text_data)else:withopen(file_path, "r", encoding="utf-8")as file: text_data = file.read()total_characters =len(text_data)tokenizer = tiktoken.get_encoding("gpt2")total_tokens =len(tokenizer.encode(text_data))print("Data downloaded")print("Characters:", total_characters)print("Tokens:", total_tokens)# Model initializationGPT_CONFIG_124M ={"vocab_size":50257,# Vocabulary size"context_length":256,# Shortened context length (orig: 1024)"emb_dim":768,# Embedding dimension"n_heads":12,# Number of attention heads"n_layers":12,# Number of layers"drop_rate":0.1,# Dropout rate"qkv_bias":False# Query-key-value bias}torch.manual_seed(123)model =GPTModel(GPT_CONFIG_124M)model.eval()print ("Model initialized")# Functions to transform from tokens to ids and from to ids to tokensdeftext_to_token_ids(text,tokenizer): encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) encoded_tensor = torch.tensor(encoded).unsqueeze(0)# add batch dimensionreturn encoded_tensordeftoken_ids_to_text(token_ids,tokenizer): flat = token_ids.squeeze(0)# remove batch dimensionreturn tokenizer.decode(flat.tolist())# Define loss functionsdefcalc_loss_batch(input_batch,target_batch,model,device): input_batch, target_batch = input_batch.to(device), target_batch.to(device) logits =model(input_batch) loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())return lossdefcalc_loss_loader(data_loader,model,device,num_batches=None): total_loss =0.iflen(data_loader)==0:returnfloat("nan")elif num_batches isNone: num_batches =len(data_loader)else:# Reduce the number of batches to match the total number of batches in the data loader# if num_batches exceeds the number of batches in the data loader num_batches =min(num_batches, len(data_loader))for i, (input_batch, target_batch) inenumerate(data_loader):if i < num_batches: loss =calc_loss_batch(input_batch, target_batch, model, device) total_loss += loss.item()else:breakreturn total_loss / num_batches# Apply Train/validation ratio and create dataloaderstrain_ratio =0.90split_idx =int(train_ratio *len(text_data))train_data = text_data[:split_idx]val_data = text_data[split_idx:]torch.manual_seed(123)train_loader =create_dataloader_v1( train_data, batch_size=2, max_length=GPT_CONFIG_124M["context_length"], stride=GPT_CONFIG_124M["context_length"], drop_last=True, shuffle=True, num_workers=0)val_loader =create_dataloader_v1( val_data, batch_size=2, max_length=GPT_CONFIG_124M["context_length"], stride=GPT_CONFIG_124M["context_length"], drop_last=False, shuffle=False, num_workers=0)# Sanity checksif total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the training loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""increase the `training_ratio`")if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the validation loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""decrease the `training_ratio`")print("Train loader:")for x, y in train_loader:print(x.shape, y.shape)print("\nValidation loader:")for x, y in val_loader:print(x.shape, y.shape)train_tokens =0for input_batch, target_batch in train_loader: train_tokens += input_batch.numel()val_tokens =0for input_batch, target_batch in val_loader: val_tokens += input_batch.numel()print("Training tokens:", train_tokens)print("Validation tokens:", val_tokens)print("All tokens:", train_tokens + val_tokens)# Indicate the device to useif torch.cuda.is_available(): device = torch.device("cuda")elif torch.backends.mps.is_available(): device = torch.device("mps")else: device = torch.device("cpu")print(f"Using {device} device.")model.to(device)# no assignment model = model.to(device) necessary for nn.Module classes# Pre-calculate losses without starting yettorch.manual_seed(123)# For reproducibility due to the shuffling in the data loaderwith torch.no_grad():# Disable gradient tracking for efficiency because we are not training, yet train_loss =calc_loss_loader(train_loader, model, device) val_loss =calc_loss_loader(val_loader, model, device)print("Training loss:", train_loss)print("Validation loss:", val_loss)# Functions to train the datadeftrain_model_simple(model,train_loader,val_loader,optimizer,device,num_epochs,eval_freq,eval_iter,start_context,tokenizer):# Initialize lists to track losses and tokens seen train_losses, val_losses, track_tokens_seen = [], [], [] tokens_seen, global_step =0,-1# Main training loopfor epoch inrange(num_epochs): model.train()# Set model to training modefor input_batch, target_batch in train_loader: optimizer.zero_grad()# Reset loss gradients from previous batch iteration loss =calc_loss_batch(input_batch, target_batch, model, device) loss.backward()# Calculate loss gradients optimizer.step()# Update model weights using loss gradients tokens_seen += input_batch.numel() global_step +=1# Optional evaluation stepif global_step % eval_freq ==0: train_loss, val_loss =evaluate_model( model, train_loader, val_loader, device, eval_iter) train_losses.append(train_loss) val_losses.append(val_loss) track_tokens_seen.append(tokens_seen)print(f"Ep {epoch+1} (Step {global_step:06d}): "f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")# Print a sample text after each epochgenerate_and_print_sample( model, tokenizer, device, start_context )return train_losses, val_losses, track_tokens_seendefevaluate_model(model,train_loader,val_loader,device,eval_iter): model.eval()with torch.no_grad(): train_loss =calc_loss_loader(train_loader, model, device, num_batches=eval_iter) val_loss =calc_loss_loader(val_loader, model, device, num_batches=eval_iter) model.train()return train_loss, val_lossdefgenerate_and_print_sample(model,tokenizer,device,start_context): model.eval() context_size = model.pos_emb.weight.shape[0] encoded =text_to_token_ids(start_context, tokenizer).to(device)with torch.no_grad(): token_ids =generate_text( model=model, idx=encoded, max_new_tokens=50, context_size=context_size ) decoded_text =token_ids_to_text(token_ids, tokenizer)print(decoded_text.replace("\n", " "))# Compact print format model.train()# Start training!import timestart_time = time.time()torch.manual_seed(123)model =GPTModel(GPT_CONFIG_124M)model.to(device)optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)num_epochs =10train_losses, val_losses, tokens_seen =train_model_simple( model, train_loader, val_loader, optimizer, device, num_epochs=num_epochs, eval_freq=5, eval_iter=5, start_context="Every effort moves you", tokenizer=tokenizer)end_time = time.time()execution_time_minutes = (end_time - start_time) /60print(f"Training completed in {execution_time_minutes:.2f} minutes.")# Show graphics with the training processimport matplotlib.pyplot as pltfrom matplotlib.ticker import MaxNLocatorimport mathdefplot_losses(epochs_seen,tokens_seen,train_losses,val_losses): fig, ax1 = plt.subplots(figsize=(5, 3)) ax1.plot(epochs_seen, train_losses, label="Training loss") ax1.plot( epochs_seen, val_losses, linestyle="-.", label="Validation loss" ) ax1.set_xlabel("Epochs") ax1.set_ylabel("Loss") ax1.legend(loc="upper right") ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) ax2 = ax1.twiny() ax2.plot(tokens_seen, train_losses, alpha=0) ax2.set_xlabel("Tokens seen") fig.tight_layout() plt.show()# Compute perplexity from the loss values train_ppls = [math.exp(loss)for loss in train_losses] val_ppls = [math.exp(loss)for loss in val_losses]# Plot perplexity over tokens seen plt.figure() plt.plot(tokens_seen, train_ppls, label='Training Perplexity') plt.plot(tokens_seen, val_ppls, label='Validation Perplexity') plt.xlabel('Tokens Seen') plt.ylabel('Perplexity') plt.title('Perplexity over Training') plt.legend() plt.show()epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)torch.save({"model_state_dict": model.state_dict(),"optimizer_state_dict": optimizer.state_dict(), },"/tmp/model_and_optimizer.pth")
Let's see an explanation step by step
Functions to transform text <--> ids
These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:
# Functions to transform from tokens to ids and from to ids to tokensdeftext_to_token_ids(text,tokenizer): encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) encoded_tensor = torch.tensor(encoded).unsqueeze(0)# add batch dimensionreturn encoded_tensordeftoken_ids_to_text(token_ids,tokenizer): flat = token_ids.squeeze(0)# remove batch dimensionreturn tokenizer.decode(flat.tolist())
Generate text functions
In a previos section a function that just got the most probable token after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.
The following generate_text function, will apply the top-k , temperature and multinomial concepts.
The top-k means that we will start reducing to -inf all the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from -inf.
The temperature means that every probability will be divided by the temperature value. A value of 0.1 will improve the highest probability compared with the lowest one, while a temperature of 5 for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.
After applying the temperature, a softmax function is applied again to make all the reminding tokens have a total probability of 1.
Finally, instead of choosing the token with the biggest probability, the function multinomial is applied to predict the next token according to the final probabilities. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.
# Generate text functiondefgenerate_text(model,idx,max_new_tokens,context_size,temperature=0.0,top_k=None,eos_id=None):# For-loop is the same as before: Get logits, and only focus on last time stepfor _ inrange(max_new_tokens): idx_cond = idx[:,-context_size:]with torch.no_grad(): logits =model(idx_cond) logits = logits[:,-1,:]# New: Filter logits with top_k samplingif top_k isnotNone:# Keep only top_k values top_logits, _ = torch.topk(logits, top_k) min_val = top_logits[:,-1] logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)# New: Apply temperature scalingif temperature >0.0: logits = logits / temperature# Apply softmax to get probabilities probs = torch.softmax(logits, dim=-1)# (batch_size, context_len)# Sample from the distribution idx_next = torch.multinomial(probs, num_samples=1)# (batch_size, 1)# Otherwise same as before: get idx of the vocab entry with the highest logits valueelse: idx_next = torch.argmax(logits, dim=-1, keepdim=True)# (batch_size, 1)if idx_next == eos_id:# Stop generating early if end-of-sequence token is encountered and eos_id is specifiedbreak# Same as before: append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1)# (batch_size, num_tokens+1)return idx
There is a common alternative to top-k called top-p, also known as nucleus sampling, which instead of getting k samples with the most probability, it organizes all the resulting vocabulary by probabilities and sums them from the highest probability to the lowest until a threshold is reached.
Then, only those words of the vocabulary will be considered according to their relative probabilities
This allows to not need to select a number of k samples, as the optimal k might be different on each case, but only a threshold.
Note that this improvement isn't included in the previous code.
Another way to improve the generated text is by using Beam search instead of the greedy search sued in this example.
Unlike greedy search, which selects the most probable next word at each step and builds a single sequence, beam search keeps track of the top 𝑘 k highest-scoring partial sequences (called "beams") at each step. By exploring multiple possibilities simultaneously, it balances efficiency and quality, increasing the chances of finding a better overall sequence that might be missed by the greedy approach due to early, suboptimal choices.
Note that this improvement isn't included in the previous code.
Loss functions
The calc_loss_batch function calculates the cross entropy of the a prediction of a single batch.
The calc_loss_loader gets the cross entropy of all the batches and calculates the average cross entropy.
# Define loss functionsdefcalc_loss_batch(input_batch,target_batch,model,device): input_batch, target_batch = input_batch.to(device), target_batch.to(device) logits =model(input_batch) loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())return lossdefcalc_loss_loader(data_loader,model,device,num_batches=None): total_loss =0.iflen(data_loader)==0:returnfloat("nan")elif num_batches isNone: num_batches =len(data_loader)else:# Reduce the number of batches to match the total number of batches in the data loader# if num_batches exceeds the number of batches in the data loader num_batches =min(num_batches, len(data_loader))for i, (input_batch, target_batch) inenumerate(data_loader):if i < num_batches: loss =calc_loss_batch(input_batch, target_batch, model, device) total_loss += loss.item()else:breakreturn total_loss / num_batches
Gradient clipping is a technique used to enhance training stability in large neural networks by setting a maximum threshold for gradient magnitudes. When gradients exceed this predefined max_norm, they are scaled down proportionally to ensure that updates to the model’s parameters remain within a manageable range, preventing issues like exploding gradients and ensuring more controlled and stable training.
Note that this improvement isn't included in the previous code.
Check the following example:
Loading Data
The functions create_dataloader_v1 and create_dataloader_v1 were already discussed in a previous section.
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.
Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case).
The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
Also the fact that stride is as big as the context length, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).
Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.
The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:
# Sanity checksif total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the training loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""increase the `training_ratio`")if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the validation loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""decrease the `training_ratio`")print("Train loader:")for x, y in train_loader:print(x.shape, y.shape)print("\nValidation loader:")for x, y in val_loader:print(x.shape, y.shape)train_tokens =0for input_batch, target_batch in train_loader: train_tokens += input_batch.numel()val_tokens =0for input_batch, target_batch in val_loader: val_tokens += input_batch.numel()print("Training tokens:", train_tokens)print("Validation tokens:", val_tokens)print("All tokens:", train_tokens + val_tokens)
Select device for training & pre calculations
The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.
# Indicate the device to useif torch.cuda.is_available(): device = torch.device("cuda")elif torch.backends.mps.is_available(): device = torch.device("mps")else: device = torch.device("cpu")print(f"Using {device} device.")model.to(device)# no assignment model = model.to(device) necessary for nn.Module classes# Pre-calculate losses without starting yettorch.manual_seed(123)# For reproducibility due to the shuffling in the data loaderwith torch.no_grad():# Disable gradient tracking for efficiency because we are not training, yet train_loss =calc_loss_loader(train_loader, model, device) val_loss =calc_loss_loader(val_loader, model, device)print("Training loss:", train_loss)print("Validation loss:", val_loss)
Training functions
The function generate_and_print_sample will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by train_model_simple on each step.
The function evaluate_model is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.
Then the big function train_model_simple is the one that actually train the model. It expects:
The train data loader (with the data already separated and prepared for training)
The validator loader
The optimizer to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see, AdamW is used, but there are many more.
optimizer.zero_grad() is called to reset the gradients on each round to not accumulate them.
The lr param is the learning rate which determines the size of the steps taken during the optimization process when updating the model's parameters. A smaller learning rate means the optimizer makes smaller updates to the weights, which can lead to more precise convergence but might slow down training. A larger learning rate can speed up training but risks overshooting the minimum of the loss function (jump over the point where the loss function is minimized).
Weight Decay modifies the Loss Calculation step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.
Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, AdamW (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.
The device to use for training
The number of epochs: Number of times to go over the training data
The evaluation frequency: The frequency to call evaluate_model
The evaluation iteration: The number of batches to use when evaluating the current state of the model when calling generate_and_print_sample
The start context: Which the starting sentence to use when calling generate_and_print_sample
The tokenizer
# Functions to train the datadeftrain_model_simple(model,train_loader,val_loader,optimizer,device,num_epochs,eval_freq,eval_iter,start_context,tokenizer):# Initialize lists to track losses and tokens seen train_losses, val_losses, track_tokens_seen = [], [], [] tokens_seen, global_step =0,-1# Main training loopfor epoch inrange(num_epochs): model.train()# Set model to training modefor input_batch, target_batch in train_loader: optimizer.zero_grad()# Reset loss gradients from previous batch iteration loss =calc_loss_batch(input_batch, target_batch, model, device) loss.backward()# Calculate loss gradients optimizer.step()# Update model weights using loss gradients tokens_seen += input_batch.numel() global_step +=1# Optional evaluation stepif global_step % eval_freq ==0: train_loss, val_loss =evaluate_model( model, train_loader, val_loader, device, eval_iter) train_losses.append(train_loss) val_losses.append(val_loss) track_tokens_seen.append(tokens_seen)print(f"Ep {epoch+1} (Step {global_step:06d}): "f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")# Print a sample text after each epochgenerate_and_print_sample( model, tokenizer, device, start_context )return train_losses, val_losses, track_tokens_seendefevaluate_model(model,train_loader,val_loader,device,eval_iter): model.eval()# Set in eval mode to avoid dropoutwith torch.no_grad(): train_loss =calc_loss_loader(train_loader, model, device, num_batches=eval_iter) val_loss =calc_loss_loader(val_loader, model, device, num_batches=eval_iter) model.train()# Back to training model applying all the configurationsreturn train_loss, val_lossdefgenerate_and_print_sample(model,tokenizer,device,start_context): model.eval()# Set in eval mode to avoid dropout context_size = model.pos_emb.weight.shape[0] encoded =text_to_token_ids(start_context, tokenizer).to(device)with torch.no_grad(): token_ids =generate_text( model=model, idx=encoded, max_new_tokens=50, context_size=context_size ) decoded_text =token_ids_to_text(token_ids, tokenizer)print(decoded_text.replace("\n", " "))# Compact print format model.train()# Back to training model applying all the configurations
To improve the learning rate there are a couple relevant techniques called linear warmup and cosine decay.
Linear warmup consist on define an initial learning rate and a maximum one and consistently update it after each epoch. This is because starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase.
Cosine decay is a technique that gradually reduces the learning rate following a half-cosine curve after the warmup phase, slowing weight updates to minimize the risk of overshooting the loss minima and ensure training stability in later phases.
Note that these improvements aren't included in the previous code.
With the following function it's possible to print the evolution of the model while it was being trained.
import matplotlib.pyplot as pltfrom matplotlib.ticker import MaxNLocatorimport mathdefplot_losses(epochs_seen,tokens_seen,train_losses,val_losses): fig, ax1 = plt.subplots(figsize=(5, 3)) ax1.plot(epochs_seen, train_losses, label="Training loss") ax1.plot( epochs_seen, val_losses, linestyle="-.", label="Validation loss" ) ax1.set_xlabel("Epochs") ax1.set_ylabel("Loss") ax1.legend(loc="upper right") ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) ax2 = ax1.twiny() ax2.plot(tokens_seen, train_losses, alpha=0) ax2.set_xlabel("Tokens seen") fig.tight_layout() plt.show()# Compute perplexity from the loss values train_ppls = [math.exp(loss)for loss in train_losses] val_ppls = [math.exp(loss)for loss in val_losses]# Plot perplexity over tokens seen plt.figure() plt.plot(tokens_seen, train_ppls, label='Training Perplexity') plt.plot(tokens_seen, val_ppls, label='Validation Perplexity') plt.xlabel('Tokens Seen') plt.ylabel('Perplexity') plt.title('Perplexity over Training') plt.legend() plt.show()epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
Save the model
It's possible to save the model + optimizer if you want to continue training later:
# Save the model and the optimizer for later trainingtorch.save({"model_state_dict": model.state_dict(),"optimizer_state_dict": optimizer.state_dict(), },"/tmp/model_and_optimizer.pth")# Note that this model with the optimizer occupied close to 2GB# Restore model and optimizer for trainingcheckpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)model =GPTModel(GPT_CONFIG_124M)model.load_state_dict(checkpoint["model_state_dict"])optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)optimizer.load_state_dict(checkpoint["optimizer_state_dict"])model.train(); # Put in training mode
Or just the model if you are planing just on using it:
# Save the modeltorch.save(model.state_dict(), "model.pth")# Load itmodel =GPTModel(GPT_CONFIG_124M)model.load_state_dict(torch.load("model.pth", map_location=device))model.eval()# Put in eval mode