4. Attention Mechanisms

Attention Mechanisms and Self-Attention in Neural Networks

Attention mechanisms allow neural networks to focus on specific parts of the input when generating each part of the output. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.

The goal of this fourth phase is very simple: Apply some attetion mechanisms. These are going to be a lot of repeated layers that are going to capture the relation of a word in the vocabulary with its neighbours in the current sentence being used to train the LLM. A lot of layers are used for this, so a lot of trainable parameters are going to be capturing this information.

Understanding Attention Mechanisms

In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.

Example: Machine Translation

Consider translating the German sentence "Kannst du mir helfen diesen Satz zu übersetzen" into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.

Introduction to Self-Attention

Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.

Key Concepts

  • Tokens: Vipengele vya kibinafsi vya mfuatano wa ingizo (e.g., maneno katika sentensi).

  • Embeddings: Uwiano wa vektori wa tokens, ukichukua taarifa za maana.

  • Attention Weights: Thamani zinazotathmini umuhimu wa kila token kulingana na wengine.

Calculating Attention Weights: A Step-by-Step Example

Let's consider the sentence "Hello shiny sun!" and represent each word with a 3-dimensional embedding:

  • Hello: [0.34, 0.22, 0.54]

  • shiny: [0.53, 0.34, 0.98]

  • sun: [0.29, 0.54, 0.93]

Our goal is to compute the context vector for the word "shiny" using self-attention.

Step 1: Compute Attention Scores

Just multiply each dimension value of the query with the relevant one of each token and add the results. You get 1 value per pair of tokens.

For each word in the sentence, compute the attention score with respect to "shiny" by calculating the dot product of their embeddings.

Attention Score between "Hello" and "shiny"

Attention Score between "shiny" and "shiny"

Attention Score between "sun" and "shiny"

Step 2: Normalize Attention Scores to Obtain Attention Weights

Don't get lost in the mathematical terms, the goal of this function is simple, normalize all the weights so they sum 1 in total.

Moreover, softmax function is used because it accentuates differences due to the exponential part, making easier to detect useful values.

Apply the softmax function to the attention scores to convert them into attention weights that sum to 1.

Calculating the exponentials:

Calculating the sum:

Calculating attention weights:

Step 3: Compute the Context Vector

Just get each attention weight and multiply it to the related token dimensions and then sum all the dimensions to get just 1 vector (the context vector)

The context vector is computed as the weighted sum of the embeddings of all words, using the attention weights.

Calculating each component:

  • Weighted Embedding of "Hello":

* **Weighted Embedding of "shiny"**:

* **Weighted Embedding of "sun"**:

Summing the weighted embeddings:

context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]

This context vector represents the enriched embedding for the word "shiny," incorporating information from all words in the sentence.

Summary of the Process

  1. Compute Attention Scores: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.

  2. Normalize Scores to Get Attention Weights: Apply the softmax function to the attention scores to obtain weights that sum to 1.

  3. Compute Context Vector: Multiply each word's embedding by its attention weight and sum the results.

Self-Attention with Trainable Weights

In practice, self-attention mechanisms use trainable weights to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:

The query is the data to use like before, while the keys and values matrices are just random-trainable matrices.

Step 1: Compute Queries, Keys, and Values

Each token will have its own query, key and value matrix by multiplying its dimension values by the defined matrices:

These matrices transform the original embeddings into a new space suitable for computing attention.

Example

Assuming:

  • Input dimension din=3 (embedding size)

  • Output dimension dout=2 (desired dimension for queries, keys, and values)

Initialize the weight matrices:

import torch.nn as nn

d_in = 3
d_out = 2

W_query = nn.Parameter(torch.rand(d_in, d_out))
W_key = nn.Parameter(torch.rand(d_in, d_out))
W_value = nn.Parameter(torch.rand(d_in, d_out))

Hesabu maswali, funguo, na thamani:

queries = torch.matmul(inputs, W_query)
keys = torch.matmul(inputs, W_key)
values = torch.matmul(inputs, W_value)

Step 2: Compute Scaled Dot-Product Attention

Compute Attention Scores

Kama ilivyo katika mfano wa awali, lakini wakati huu, badala ya kutumia thamani za vipimo vya tokens, tunatumia matrix ya funguo ya token (iliyohesabiwa tayari kwa kutumia vipimo):. Hivyo, kwa kila query qi​ na funguo kj​:

Scale the Scores

Ili kuzuia dot products kuwa kubwa sana, ziongeze kwa mzizi wa mraba wa kipimo cha funguo dk​:

Alama inagawanywa kwa mzizi wa mraba wa vipimo kwa sababu dot products zinaweza kuwa kubwa sana na hii inasaidia kudhibiti.

Apply Softmax to Obtain Attention Weights: Kama katika mfano wa awali, sanifisha thamani zote ili zijumuishe 1.

Step 3: Compute Context Vectors

Kama katika mfano wa awali, jumlisha tu matrices zote za thamani ukizidisha kila moja kwa uzito wake wa umakini:

Code Example

Kuchukua mfano kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb unaweza kuangalia darasa hili linalotekeleza kazi ya kujitunza tuliyozungumzia:

import torch

inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your     (x^1)
[0.55, 0.87, 0.66], # journey  (x^2)
[0.57, 0.85, 0.64], # starts   (x^3)
[0.22, 0.58, 0.33], # with     (x^4)
[0.77, 0.25, 0.10], # one      (x^5)
[0.05, 0.80, 0.55]] # step     (x^6)
)

import torch.nn as nn
class SelfAttention_v2(nn.Module):

def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

context_vec = attn_weights @ values
return context_vec

d_in=3
d_out=2
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

Kumbuka kwamba badala ya kuanzisha matrices na thamani za nasibu, nn.Linear inatumika kuashiria uzito wote kama vigezo vya kufundisha.

Causal Attention: Kuficha Maneno ya Baadaye

Kwa LLMs tunataka mfano uzingatie tu tokens ambazo zinaonekana kabla ya nafasi ya sasa ili kutabiri token inayofuata. Causal attention, pia inajulikana kama masked attention, inafanikiwa kwa kubadilisha mekanizma ya attention ili kuzuia ufikiaji wa tokens za baadaye.

Kutumia Mask ya Causal Attention

Ili kutekeleza causal attention, tunatumia mask kwa alama za attention kabla ya operesheni ya softmax ili zile zilizobaki bado zikusanye 1. Mask hii inaweka alama za attention za tokens za baadaye kuwa negative infinity, kuhakikisha kwamba baada ya softmax, uzito wao wa attention ni sifuri.

Hatua

  1. Hesabu Alama za Attention: Kama ilivyokuwa hapo awali.

  2. Tumia Mask: Tumia matrix ya juu ya pembeni iliyojaa negative infinity juu ya diagonal.

mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
masked_scores = attention_scores + mask
  1. Tumia Softmax: Hesabu uzito wa attention kwa kutumia alama zilizofichwa.

attention_weights = torch.softmax(masked_scores, dim=-1)

Kuficha Uzito wa Ziada wa Attention kwa Kutumia Dropout

Ili kuzuia overfitting, tunaweza kutumia dropout kwa uzito wa attention baada ya operesheni ya softmax. Dropout hufanya sifuri kwa nasibu baadhi ya uzito wa attention wakati wa mafunzo.

dropout = nn.Dropout(p=0.5)
attention_weights = dropout(attention_weights)

A regular dropout ni takriban 10-20%.

Code Example

Code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb:

import torch
import torch.nn as nn

inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your     (x^1)
[0.55, 0.87, 0.66], # journey  (x^2)
[0.57, 0.85, 0.64], # starts   (x^3)
[0.22, 0.58, 0.33], # with     (x^4)
[0.77, 0.25, 0.10], # one      (x^5)
[0.05, 0.80, 0.55]] # step     (x^6)
)

batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)

class CausalAttention(nn.Module):

def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token

keys = self.W_key(x) # This generates the keys of the tokens
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
attn_scores.masked_fill_(  # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)

context_vec = attn_weights @ values
return context_vec

torch.manual_seed(123)

context_length = batch.shape[1]
d_in = 3
d_out = 2
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

Kupanua Umakini wa Kichwa Kimoja hadi Umakini wa Vichwa Vingi

Umakini wa vichwa vingi kwa maneno ya vitendo unajumuisha kutekeleza matukio mengi ya kazi ya umakini wa ndani kila moja ikiwa na uzito wake mwenyewe ili vektori tofauti za mwisho zihesabiwe.

Mfano wa Kanuni

Inaweza kuwa inawezekana kutumia tena kanuni ya awali na kuongeza tu kifuniko kinachokifanya kiendeshe mara kadhaa, lakini hii ni toleo lililoimarishwa zaidi kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb ambayo inashughulikia vichwa vyote kwa wakati mmoja (ikiwapunguza idadi ya mizunguko ya gharama kubwa). Kama unavyoona katika kanuni, vipimo vya kila token vinagawanywa katika vipimo tofauti kulingana na idadi ya vichwa. Kwa njia hii, ikiwa token ina vipimo 8 na tunataka kutumia vichwa 3, vipimo vitagawanywa katika arrays 2 za vipimo 4 na kila kichwa kitatumia moja yao:

class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"

self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)

def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token

keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)

# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)

# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)

# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)

# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection

return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

Kwa utekelezaji mwingine wa kompakt na mzuri unaweza kutumia torch.nn.MultiheadAttention darasa katika PyTorch.

Jibu fupi la ChatGPT kuhusu kwa nini ni bora kugawanya vipimo vya tokens kati ya vichwa badala ya kuwa na kila kichwa kinachunguza vipimo vyote vya tokens zote:

Ingawa kuruhusu kila kichwa kushughulikia vipimo vyote vya embedding kunaweza kuonekana kuwa na faida kwa sababu kila kichwa kitakuwa na ufikiaji wa taarifa kamili, mazoea ya kawaida ni kugawanya vipimo vya embedding kati ya vichwa. Njia hii inalinganisha ufanisi wa kompyuta na utendaji wa mfano na inahimiza kila kichwa kujifunza uwakilishi tofauti. Hivyo, kugawanya vipimo vya embedding kwa ujumla kunapendekezwa kuliko kuwa na kila kichwa kinachunguza vipimo vyote.

Marejeo

Last updated