2. Data Sampling
Data Sampling
Data Sampling is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.
The goal of this second phase is very simple: Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.
Why Data Sampling Matters
LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.
Key Concepts in Data Sampling
Tokenization: Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).
Sequence Length (max_length): The number of tokens in each input sequence.
Sliding Window: A method to create overlapping input sequences by moving a window over the tokenized text.
Stride: The number of tokens the sliding window moves forward to create the next sequence.
Step-by-Step Example
Let's walk through an example to illustrate data sampling.
Example Text
Tokenization
Assume we use a basic tokenizer that splits the text into words and punctuation marks:
Parameters
Max Sequence Length (max_length): 4 tokens
Sliding Window Stride: 1 token
Creating Input and Target Sequences
Sliding Window Approach:
Input Sequences: Each input sequence consists of
max_length
tokens.Target Sequences: Each target sequence consists of the tokens that immediately follow the corresponding input sequence.
Generating Sequences:
Window PositionInput SequenceTarget Sequence1
["Lorem", "ipsum", "dolor", "sit"]
["ipsum", "dolor", "sit", "amet,"]
2
["ipsum", "dolor", "sit", "amet,"]
["dolor", "sit", "amet,", "consectetur"]
3
["dolor", "sit", "amet,", "consectetur"]
["sit", "amet,", "consectetur", "adipiscing"]
4
["sit", "amet,", "consectetur", "adipiscing"]
["amet,", "consectetur", "adipiscing", "elit."]
Resulting Input and Target Arrays:
Input:
Target:
Visual Representation
1
Lorem
2
ipsum
3
dolor
4
sit
5
amet,
6
consectetur
7
adipiscing
8
elit.
Sliding Window with Stride 1:
First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]
Second Window (Positions 2-5): ["ipsum", "dolor", "sit", "amet,"] → Target: ["dolor", "sit", "amet,", "consectetur"]
Third Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]
Fourth Window (Positions 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → Target: ["amet,", "consectetur", "adipiscing", "elit."]
Understanding Stride
Stride of 1: The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.
Stride of 2: The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.
Stride Equal to max_length: The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.
Example with Stride of 2:
Using the same tokenized text and max_length
of 4:
First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]
Second Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]
Third Window (Positions 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → Target: ["consectetur", "adipiscing", "elit.", "sed"] (Assuming continuation)
Code Example
Let's understand this better from a code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:
References
Last updated