In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into learning the tokens it needs to generate.
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
The goal of this sixth phase is very simple: Train the model from scratch. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
Text Evaluation
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via backpropagation. This requires a loss function to maximize. In this case, the function will be the difference between the performed prediction and the desired one.
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base e) of 7.4541e-05 is approximately -9.5042.
Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
Therefore, after performing the natural logarithm to each prediction, the average is calculated, the minus symbol removed (this is called cross entropy loss) and thats the number to reduce as close to 0 as possible because the natural logarithm of 1 is 0:
Another way to measure how good the model is is called perplexity. Perplexity is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the model's uncertainty when predicting the next token in a sequence.
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
Previous code used here but already explained in previous sections
"""This is code explained before so it won't be exaplained"""import tiktokenimport torchimport torch.nn as nnfrom torch.utils.data import Dataset, DataLoaderclassGPTDatasetV1(Dataset):def__init__(self,txt,tokenizer,max_length,stride): self.input_ids = [] self.target_ids = []# Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})# Use a sliding window to chunk the book into overlapping sequences of max_lengthfor i inrange(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i +1: i + max_length +1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk))def__len__(self):returnlen(self.input_ids)def__getitem__(self,idx):return self.input_ids[idx], self.target_ids[idx]defcreate_dataloader_v1(txt,batch_size=4,max_length=256,stride=128,shuffle=True,drop_last=True,num_workers=0):# Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2")# Create dataset dataset =GPTDatasetV1(txt, tokenizer, max_length, stride)# Create dataloader dataloader =DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)return dataloaderclassMultiHeadAttention(nn.Module):def__init__(self,d_in,d_out,context_length,dropout,num_heads,qkv_bias=False):super().__init__()assert d_out % num_heads ==0,"d_out must be divisible by n_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out)# Linear layer to combine head outputs self.dropout = nn.Dropout(dropout) self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))defforward(self,x): b, num_tokens, d_in = x.shape keys = self.W_key(x)# Shape: (b, num_tokens, d_out) queries = self.W_query(x) values = self.W_value(x)# We implicitly split the matrix by adding a `num_heads` dimension# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) keys = keys.transpose(1, 2) queries = queries.transpose(1, 2) values = values.transpose(1, 2)# Compute scaled dot-product attention (aka self-attention) with a causal mask attn_scores = queries @ keys.transpose(2, 3)# Dot product for each head# Original mask truncated to the number of tokens and converted to boolean mask_bool = self.mask.bool()[:num_tokens,:num_tokens]# Use the mask to fill attention scores attn_scores.masked_fill_(mask_bool, -torch.inf) attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights)# Shape: (b, num_tokens, num_heads, head_dim) context_vec = (attn_weights @ values).transpose(1, 2)# Combine heads, where self.d_out = self.num_heads * self.head_dim context_vec = context_vec.reshape(b, num_tokens, self.d_out) context_vec = self.out_proj(context_vec)# optional projectionreturn context_vecclassLayerNorm(nn.Module):def__init__(self,emb_dim):super().__init__() self.eps =1e-5 self.scale = nn.Parameter(torch.ones(emb_dim)) self.shift = nn.Parameter(torch.zeros(emb_dim))defforward(self,x): mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, keepdim=True, unbiased=False) norm_x = (x - mean) / torch.sqrt(var + self.eps)return self.scale * norm_x + self.shiftclassGELU(nn.Module):def__init__(self):super().__init__()defforward(self,x):return0.5* x * (1+ torch.tanh( torch.sqrt(torch.tensor(2.0/ torch.pi)) * (x +0.044715* torch.pow(x, 3)) ))classFeedForward(nn.Module):def__init__(self,cfg):super().__init__() self.layers = nn.Sequential( nn.Linear(cfg["emb_dim"], 4* cfg["emb_dim"]),GELU(), nn.Linear(4* cfg["emb_dim"], cfg["emb_dim"]), )defforward(self,x):return self.layers(x)classTransformerBlock(nn.Module):def__init__(self,cfg):super().__init__() self.att =MultiHeadAttention( d_in=cfg["emb_dim"], d_out=cfg["emb_dim"], context_length=cfg["context_length"], num_heads=cfg["n_heads"], dropout=cfg["drop_rate"], qkv_bias=cfg["qkv_bias"]) self.ff =FeedForward(cfg) self.norm1 =LayerNorm(cfg["emb_dim"]) self.norm2 =LayerNorm(cfg["emb_dim"]) self.drop_shortcut = nn.Dropout(cfg["drop_rate"])defforward(self,x):# Shortcut connection for attention block shortcut = x x = self.norm1(x) x = self.att(x)# Shape [batch_size, num_tokens, emb_size] x = self.drop_shortcut(x) x = x + shortcut # Add the original input back# Shortcut connection for feed-forward block shortcut = x x = self.norm2(x) x = self.ff(x) x = self.drop_shortcut(x) x = x + shortcut # Add the original input backreturn xclassGPTModel(nn.Module):def__init__(self,cfg):super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ inrange(cfg["n_layers"])]) self.final_norm =LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)defforward(self,in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x)return logits
# Download contents to train the data withimport osimport urllib.requestfile_path ="the-verdict.txt"url ="https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"ifnot os.path.exists(file_path):with urllib.request.urlopen(url)as response: text_data = response.read().decode('utf-8')withopen(file_path, "w", encoding="utf-8")as file: file.write(text_data)else:withopen(file_path, "r", encoding="utf-8")as file: text_data = file.read()total_characters =len(text_data)tokenizer = tiktoken.get_encoding("gpt2")total_tokens =len(tokenizer.encode(text_data))print("Data downloaded")print("Characters:", total_characters)print("Tokens:", total_tokens)# Model initializationGPT_CONFIG_124M ={"vocab_size":50257,# Vocabulary size"context_length":256,# Shortened context length (orig: 1024)"emb_dim":768,# Embedding dimension"n_heads":12,# Number of attention heads"n_layers":12,# Number of layers"drop_rate":0.1,# Dropout rate"qkv_bias":False# Query-key-value bias}torch.manual_seed(123)model =GPTModel(GPT_CONFIG_124M)model.eval()print ("Model initialized")# Functions to transform from tokens to ids and from to ids to tokensdeftext_to_token_ids(text,tokenizer): encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) encoded_tensor = torch.tensor(encoded).unsqueeze(0)# add batch dimensionreturn encoded_tensordeftoken_ids_to_text(token_ids,tokenizer): flat = token_ids.squeeze(0)# remove batch dimensionreturn tokenizer.decode(flat.tolist())# Define loss functionsdefcalc_loss_batch(input_batch,target_batch,model,device): input_batch, target_batch = input_batch.to(device), target_batch.to(device) logits =model(input_batch) loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())return lossdefcalc_loss_loader(data_loader,model,device,num_batches=None): total_loss =0.iflen(data_loader)==0:returnfloat("nan")elif num_batches isNone: num_batches =len(data_loader)else:# Reduce the number of batches to match the total number of batches in the data loader# if num_batches exceeds the number of batches in the data loader num_batches =min(num_batches, len(data_loader))for i, (input_batch, target_batch) inenumerate(data_loader):if i < num_batches: loss =calc_loss_batch(input_batch, target_batch, model, device) total_loss += loss.item()else:breakreturn total_loss / num_batches# Apply Train/validation ratio and create dataloaderstrain_ratio =0.90split_idx =int(train_ratio *len(text_data))train_data = text_data[:split_idx]val_data = text_data[split_idx:]torch.manual_seed(123)train_loader =create_dataloader_v1( train_data, batch_size=2, max_length=GPT_CONFIG_124M["context_length"], stride=GPT_CONFIG_124M["context_length"], drop_last=True, shuffle=True, num_workers=0)val_loader =create_dataloader_v1( val_data, batch_size=2, max_length=GPT_CONFIG_124M["context_length"], stride=GPT_CONFIG_124M["context_length"], drop_last=False, shuffle=False, num_workers=0)# Sanity checksif total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the training loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""increase the `training_ratio`")if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:print("Not enough tokens for the validation loader. ""Try to lower the `GPT_CONFIG_124M['context_length']` or ""decrease the `training_ratio`")print("Train loader:")for x, y in train_loader:print(x.shape, y.shape)print("\nValidation loader:")for x, y in val_loader:print(x.shape, y.shape)train_tokens =0for input_batch, target_batch in train_loader: train_tokens += input_batch.numel()val_tokens =0for input_batch, target_batch in val_loader: val_tokens += input_batch.numel()print("Training tokens:", train_tokens)print("Validation tokens:", val_tokens)print("All tokens:", train_tokens + val_tokens)# Indicate the device to useif torch.cuda.is_available(): device = torch.device("cuda")elif torch.backends.mps.is_available(): device = torch.device("mps")else: device = torch.device("cpu")print(f"Using {device} device.")model.to(device)# no assignment model = model.to(device) necessary for nn.Module classes# Pre-calculate losses without starting yettorch.manual_seed(123)# For reproducibility due to the shuffling in the data loaderwith torch.no_grad():# Disable gradient tracking for efficiency because we are not training, yet train_loss =calc_loss_loader(train_loader, model, device) val_loss =calc_loss_loader(val_loader, model, device)print("Training loss:", train_loss)print("Validation loss:", val_loss)# Functions to train the datadeftrain_model_simple(model,train_loader,val_loader,optimizer,device,num_epochs,eval_freq,eval_iter,start_context,tokenizer):# Initialize lists to track losses and tokens seen train_losses, val_losses, track_tokens_seen = [], [], [] tokens_seen, global_step =0,-1# Main training loopfor epoch inrange(num_epochs):