Data Preparation for Large Language Model
Data preparation for Large Language Model.¶
I will discuss the high level concept of Data preparation steps for LLM. The context is from the book "Build a Large Language Model (From Scratch)" by Sebastian Raschka. The figure shows the LLM data preparation pipeline.
1. Tokenizer¶
The first step of LLM is the Data Preparation. We feed LLM with textual data (for generative models). However, they need to be preprocessed before fed into LLM.
It begins with tokenization. Tokenization can be of different types:
- 1) character based ['I', 'a', 'm', 'a', 'n', ..];
- 2) word based ['I', 'am', 'an', amateur'] or
- 3) subword based ['I', 'am', 'an', 'am', 'e', 'teur'].
These example is just for demostration, not representing actual tokenization process.
From our statistic or ML knowledge, we know that these models can only process numeric data. Therefore, we need to convert the tokens into numeric values.
We will use "gpt2" tokenizer which uses Byte-Pair Encoding (BPE) as tokenizer. We need to import 'tiktoken' library to get the gpt2 tokenizer.
This tokenizer will convert entire text into tokens and corresponding numeric represtation (token ids).
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
'the-verdict.txt' book will be used as the text data source for this example. We need to load the data first.
with open("the-verdict.txt", "r") as f:
texts = f.read()
print(texts[:50])
I HAD always thought Jack Gisburn rather a cheap g
Let us see how gpt2 tokenizer converts the texts into token ids using .encode function call.
# encoding creates token ids
token_ids50 = tokenizer.encode(texts[:50])
print(token_ids50)
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 308]
Let's now see the token ids and corresponding words encoded by gpt2 tokenizer. The .decode function call convert each token ids into corresponding words or subwords.
# decoding retrives corresponding words or subwords from token ids
tokens50 = tokenizer.decode(token_ids50)
print(tokens50)
I HAD always thought Jack Gisburn rather a cheap g
for token_id in token_ids50:
token = tokenizer.decode([token_id])
print(token,'----->',token_id)
I -----> 40 H -----> 367 AD -----> 2885 always -----> 1464 thought -----> 1807 Jack -----> 3619 G -----> 402 is -----> 271 burn -----> 10899 rather -----> 2138 a -----> 257 cheap -----> 7026 g -----> 308
From the above we can see how tokenization works. It is worth to mention that these tokenization process also capture the trailing or begining space of any tokens.
2. A Dataset and Dataloader¶
LLM is nothing but a a next word predictor. If we feed LLM with [The capital of Bangladesh is ], we can hope to get [Dhaka]. We need to create dataset with input and target pairs. This approach is similar to time-series forecasting using sliding window approach.
Suppose, we have 13 data points. [1,3,9,0,2,4,6,8,7,5,6,4,0]. In sliding window approach, we with a window size=4 and also with stride=4, our input target pair may look like the following.
input ------------> target
[1,3,9,0]------------>[3,9,0,2]
[2,4,6,8]------------>[4,6,8,7]
[7,5,6,4]------------>[5,6,4,0]
We can observe that the targets are data points shifted by one position from inputs.. Stride represents the shift of data points. We can see that row wise the data points are shifted by 4 numbers. [1,3,9,0] then [2,4,6,8] and so on.
Similarly, the input token_ids of dataset (texts of any documents) are feed to LLM as input:target pairs. LLM will be trained to predict the next token_id during pretrain.
Pytorch has Dataset and Dataloader library. We will use that for our work and create GPTdatasetV1 class for creating input:target pairs. For effective represntation of token_ids into smaller floating numbers, token_ids are converted into tensors.
First, let's create a dataset class for creating input--target pairs from our data. max_length is the window size used in Timeseries.
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
token_ids = tokenizer.encode(txt) #1
for i in range(0, len(token_ids) - max_length, stride): #2
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self): #3
return len(self.input_ids)
def __getitem__(self, idx): #4
return self.input_ids[idx], self.target_ids[idx]
# max_length = number of token_ids consider for sliding windows
# stride = shifting of number of token_ids
# for example, stride=1, [tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
print(torch.cuda.is_available())
True
We also need to use dataloader for loading the dataset into LLM. We can feed the whole dataset (entire input:target pairs) or alternatively can load as batched input.
Let's create dataloader class.
After creating input:target pairs based on the size of max_length, there may exist some data whose total size is less than the max_length. drop_last= True excludes those data in our LLM. num_workers subdivides the training process of LLM into more than one processor. Recommendatation is 0.
def create_dataloader_v1(txt, batch_size=2, max_length=256,
stride=256, shuffle=True, drop_last=True,
num_workers=0):
tokenizer = tiktoken.get_encoding("gpt2") #1
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #2
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last, #3
num_workers=num_workers #4
)
return dataloader
Let's see the first and second input:target pairs of our data.
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
max_length = 6
dataloader = create_dataloader_v1(
raw_text, batch_size = 1, max_length = max_length, stride = max_length,
shuffle = False)
data_iter = iter(dataloader) #1
first_batch = next(data_iter)
print('input:',first_batch[0],"\n",'target:', first_batch[-1])
input: tensor([[ 40, 367, 2885, 1464, 1807, 3619]]) target: tensor([[ 367, 2885, 1464, 1807, 3619, 402]])
second_batch = next(data_iter)
print('input:',second_batch[0],"\n",'target:',second_batch[-1])
input: tensor([[ 402, 271, 10899, 2138, 257, 7026]]) target: tensor([[ 271, 10899, 2138, 257, 7026, 15632]])
The inputs and targets are converted into 2-Dimensional vectors with 1 row and max_length columns.
second_batch[0].shape
torch.Size([1, 6])
3. Token Embedding¶
Although tokenizer converts raw text into numeric values, it does not store any semantic relationship among words. For example, dog and bitch are related to each other than with mango. But mango is more related to word apple than cat. In order to store these type of semantic relationship among tokens, embedding is done. It is a better approach than one-hot encoding. Embedding converts each token_id into higher dimensional vectors. It is a great way to capture semantic relations among different words or tokens.
gpt2 tokenizer has 50257 identical tokens. We will convert each token_id into a 256 dimensional vector. Original gpt model used 768 dimensions. The new model of gpt4 used 1024 dimension.
These embedding layers are part of the LLM and are updated (trained) during model training
# embedding is like a Look-Up table
vocab_size = 50257
output_dim = 256
torch.manual_seed(123)
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer.weight)
Parameter containing: tensor([[ 0.3374, -0.1778, -0.3035, ..., 1.3337, 0.0771, -0.0522], [ 0.2386, 0.1411, -1.3354, ..., -0.0315, -1.0640, 0.9417], [-1.3152, -0.0677, -0.1350, ..., -0.3181, -1.3936, 0.5226], ..., [ 0.5871, -0.0572, -1.1628, ..., -0.6887, -0.7364, 0.4479], [ 0.4438, 0.7411, 1.1263, ..., 1.2091, 0.6781, 0.3331], [-0.2537, 0.1446, 0.7203, ..., -0.2134, 0.2144, 0.3006]], requires_grad=True)
torch.manual_seed(123)
# Get embedding for token index 0
first_em = token_embedding_layer(torch.tensor([1]))
# Turn off scientific notation
torch.set_printoptions(sci_mode=False)
# Print the tensor
print(first_em)
tensor([[ 0.2386, 0.1411, -1.3354, -2.9340, 0.1141, -1.2072, -0.3008, 0.1427, -1.3027, -0.4919, -2.1429, 0.9488, -0.5684, -0.0646, 0.6647, -2.7836, 1.1366, 0.9089, 0.9494, 0.0266, -0.9221, 0.7034, -0.3659, -0.1965, -0.9207, 0.3154, -0.0217, 0.3441, 0.2271, -0.4597, -0.6183, 0.2461, -0.4055, -0.8368, 1.2277, -0.4297, -2.2121, -0.3780, 0.9838, -1.0895, 0.2017, 0.0221, -1.7753, -0.7490, 0.2781, -0.9621, -0.4223, -1.1036, 0.2473, 1.4549, -0.2835, -0.3767, -0.0306, -0.0894, -0.1965, -0.9713, 0.9005, -0.2523, 1.0669, -0.2985, 0.8558, 1.6098, -1.1893, 1.1677, 0.3277, -0.8331, -1.6179, 0.2265, -0.4382, 0.3265, -1.5786, -1.3995, 0.5446, -0.0830, -1.1753, 1.7825, 1.7524, -0.2135, 0.4095, 0.0465, 0.6367, -0.1943, -0.8614, 0.5338, 0.9376, -0.9225, 0.7047, -0.2722, 0.0144, -0.6411, 2.3902, -1.4256, -0.4619, -1.5539, -0.3338, 0.2405, 2.1065, 0.5509, -0.2936, -1.8027, -0.6933, 1.7409, 0.2698, 0.9595, -1.0253, -0.5505, 1.0264, -0.5670, -0.2658, -1.1116, -1.3696, -0.6534, -1.6125, -0.2284, 1.8388, -0.9473, 0.1419, 0.3696, -0.0174, -0.9575, -0.8169, -0.2866, 0.4343, -0.1340, -2.1467, -1.7984, -0.6822, -0.5191, 0.0093, -1.8110, -0.2443, 0.1327, 1.0875, -0.1029, 0.8604, 0.2078, 0.2027, 0.5021, -0.4063, 0.6664, 0.4765, -1.4498, 1.5446, 1.0394, 2.1681, 0.4884, 0.3359, -1.2282, -0.1200, 0.4884, 1.9431, 0.2169, -0.4743, -0.3679, -0.2918, -1.6531, 0.7692, -1.1323, 2.9590, 0.8171, 0.7668, 1.3258, 0.2103, 1.7876, -1.2128, 0.2045, 1.1051, -0.5454, 0.1073, 0.8727, -1.2800, -0.4619, 1.4342, -1.2103, 1.3834, 0.0324, 0.5421, 0.8796, 0.2713, 1.6067, -1.0004, 0.7392, -0.4931, 0.4073, -1.0394, -0.3226, 0.7226, 0.2674, -0.4673, 0.6916, -1.8752, 0.3008, -0.1468, 1.3672, 0.7074, 0.3276, 1.0658, 1.4130, -1.2445, 0.2227, 0.4593, -0.3845, 0.6554, -0.1045, -1.1134, 0.5110, 0.3566, 1.8591, -0.9300, 1.1186, 1.7495, 2.3058, 0.3734, 0.3314, -0.1871, 0.1770, 2.9641, 0.2307, 0.3228, 0.2610, 0.3219, 1.7745, 0.3155, -0.9364, 0.5687, -0.0959, 0.0046, -1.4321, -0.1535, -0.1925, -0.3115, -0.1812, -0.8745, -0.0270, 0.5424, 1.3656, -0.0284, -0.7411, -0.0169, 1.7024, 0.4206, 0.9317, 0.9884, -0.3948, 0.6919, 1.2310, -0.5126, -1.2635, 1.1440, 0.7619, 0.6543, -1.5402, -0.5176, -0.0315, -1.0640, 0.9417]], grad_fn=<EmbeddingBackward0>)
From the above, we can see that embedding is just a higher dimensional represntation of the and works as a look up table.
4. Position Embedding¶
In order to capture the position of the words in a sentence, absolute (or relative) position encoding are use.
For example: I saw a saw to saw.
Words' position in a sentence carry special meaning and these information are stored by position encoding.
max_length = 1024
context_length = max_length
torch.manual_seed(123)
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
5. Input Embedding¶
The final process of the data preprocessing is creating Input Embedding by summation of Token Embedding and Position Embedding.
The token_embedding has dimension of (batch_size, max_lenght, out_dim) and the position_embedding has a dimension of (max_lenght, out_dim). Therefore, they can be summed up mathematically.
Let's load the entire dataset in two batches. And apply both token_embedding and position_embedding. By summing up two vectors, we will get input embeddings.
batch_size = 2
dataloader = create_dataloader_v1(
raw_text,
batch_size=batch_size,
max_length=max_length,
stride=max_length
)
for batch in dataloader:
x, y = batch
token_embeddings = token_embedding_layer(x)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
input_embeddings = token_embeddings + pos_embeddings
break
input_embeddings.shape
torch.Size([2, 1024, 256])
Comments
Post a Comment