2. Data Sampling

Data Sampling

Data Sampling is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.

The goal of this second phase is very simple: Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.

Why Data Sampling Matters

LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.

Key Concepts in Data Sampling

Tokenization: Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).
Sequence Length (max_length): The number of tokens in each input sequence.
Sliding Window: A method to create overlapping input sequences by moving a window over the tokenized text.
Stride: The number of tokens the sliding window moves forward to create the next sequence.

Step-by-Step Example

Let's walk through an example to illustrate data sampling.

Example Text

"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

Tokenization

Assume we use a basic tokenizer that splits the text into words and punctuation marks:

Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]

Parameters

Max Sequence Length (max_length): 4 tokens
Sliding Window Stride: 1 token

Creating Input and Target Sequences

Sliding Window Approach:
- Input Sequences: Each input sequence consists of max_length tokens.
- Target Sequences: Each target sequence consists of the tokens that immediately follow the corresponding input sequence.

Generating Sequences:

Window Position	Input Sequence	Target Sequence
1	["Lorem", "ipsum", "dolor", "sit"]	["ipsum", "dolor", "sit", "amet,"]
2	["ipsum", "dolor", "sit", "amet,"]	["dolor", "sit", "amet,", "consectetur"]
3	["dolor", "sit", "amet,", "consectetur"]	["sit", "amet,", "consectetur", "adipiscing"]
4	["sit", "amet,", "consectetur", "adipiscing"]	["amet,", "consectetur", "adipiscing", "elit."]

Window Position

Input Sequence

Target Sequence

["Lorem", "ipsum", "dolor", "sit"]

["ipsum", "dolor", "sit", "amet,"]

["dolor", "sit", "amet,", "consectetur"]

["sit", "amet,", "consectetur", "adipiscing"]

["amet,", "consectetur", "adipiscing", "elit."]

Resulting Input and Target Arrays:

Input:

[
  ["Lorem", "ipsum", "dolor", "sit"],
  ["ipsum", "dolor", "sit", "amet,"],
  ["dolor", "sit", "amet,", "consectetur"],
  ["sit", "amet,", "consectetur", "adipiscing"],
]

Target:

[
  ["ipsum", "dolor", "sit", "amet,"],
  ["dolor", "sit", "amet,", "consectetur"],
  ["sit", "amet,", "consectetur", "adipiscing"],
  ["amet,", "consectetur", "adipiscing", "elit."],
]

Visual Representation

Token Position	Token
1	Lorem
2	ipsum
3	dolor
4	sit
5	amet,
6	consectetur
7	adipiscing
8	elit.

Sliding Window with Stride 1:

First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]
Second Window (Positions 2-5): ["ipsum", "dolor", "sit", "amet,"] → Target: ["dolor", "sit", "amet,", "consectetur"]
Third Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]
Fourth Window (Positions 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → Target: ["amet,", "consectetur", "adipiscing", "elit."]

Understanding Stride

Stride of 1: The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.
Stride of 2: The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.
Stride Equal to max_length: The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.

Example with Stride of 2:

Using the same tokenized text and max_length of 4:

First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]
Second Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]
Third Window (Positions 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → Target: ["consectetur", "adipiscing", "elit.", "sed"] (Assuming continuation)

Code Example

Let's understand this better from a code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""

import tiktoken

def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader


"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[   40,   367,  2885,  1464],
        [  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257]]),
# Target
tensor([[  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257],
        [10899,  2138,   257,  7026]])
]

# With stride=4 this will be the result:
[
# Input
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]),
# Target
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
]

References

https://www.manning.com/books/build-a-large-language-model-from-scratch

Previous1. Tokenizing Next3. Token Embeddings

Last updated 2 months ago