Data Preparation for Large Language Model
Data preparation for Large Language Model.¶
I will discuss the high level concept of Data preparation steps for LLM. The context is from the book "Build a Large Language Model (From Scratch)" by Sebastian Raschka. The following figure represents the data preparation pipeline of an LLM.
1. Tokenizer¶
The first step of LLM is the Data Preparation. We need to feed textual data to LLM(for generative models). However, data need to be preprocessed before fed into LLM.
It begins with tokenization. Tokenization can be of different types:
- 1) character based ['I', 'a', 'm', 'a', 'n', ..];
- 2) word based ['I', 'am', 'an', amateur'] or
- 3) subword based ['I', 'am', 'an', 'am', 'e', 'teur'].
These examples are just for demostration, do not represent the actual tokenization process.
From our statistic or ML knowledge, we know that computer can only process numeric data. Therefore, we need to convert the tokens into numeric values.
We will use "gpt2" tokenizer which uses Byte-Pair Encoding (BPE) as tokenizer. We will first import 'tiktoken' library to get the gpt2 tokenizer.
This tokenizer will convert entire text into tokens and corresponding numeric represtation (token ids).
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
'the-verdict.txt' book will be used as the text data source for this example. We need to load the data first.
with open("the-verdict.txt", "r") as f:
texts = f.read()
print(texts[:50])
I HAD always thought Jack Gisburn rather a cheap g
Let us see how gpt2 tokenizer converts the texts into token ids using .encode function.
# encoding creates token ids
token_ids50 = tokenizer.encode(texts[:50])
print(token_ids50)
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 308]
Let us now see the token ids and corresponding words encoded by gpt2 tokenizer. The .decode function converts each token_ids into corresponding words or subwords.
# decoding retrives corresponding words or subwords from token ids
tokens50 = tokenizer.decode(token_ids50)
print(tokens50)
I HAD always thought Jack Gisburn rather a cheap g
for token_id in token_ids50:
token = tokenizer.decode([token_id])
print(token,'----->',token_id)
I -----> 40 H -----> 367 AD -----> 2885 always -----> 1464 thought -----> 1807 Jack -----> 3619 G -----> 402 is -----> 271 burn -----> 10899 rather -----> 2138 a -----> 257 cheap -----> 7026 g -----> 308
From the above presentation, we can see how tokenization works. It is worth to mention that tokenization process also capture the trailing or begining space of any tokens.
2. A Dataset and Dataloader¶
LLM is nothing but a next word predictor. If we feed LLM with the sentence "The capital of Bangladesh is", we can hope to get next word will be "Dhaka". We need to create dataset with input and target pairs. This approach is similar to time-series forecasting using sliding window approach.
Suppose, we have 13 data points. [1,3,9,0,2,4,6,8,7,5,6,4,0]. In sliding window approach, with a window size = 4 and also with stride = 4, our input target pairs may look like the following.
input ------------> target
[1,3,9,0]------------>[3,9,0,2]
[2,4,6,8]------------>[4,6,8,7]
[7,5,6,4]------------>[5,6,4,0]
We can observe that the targets are data points shifted by one position from inputs. We see that row-wise the data points are shifted by 4 numbers. [1,3,9,0] then [2,4,6,8] and so on. Stride represents the shift of data batch.
Similarly, the input token_ids of dataset (texts of any documents) are feed to LLM as input:target pairs. LLM will be trained to predict the next token_id during training.
Pytorch has Dataset and Dataloader library. We will use these libraries for our work and create GPTdatasetV1 class for creating input:target pairs. For effective represntation of token_ids into smaller floating numbers, token_ids are converted into tensors.
First, let's create a dataset class for creating input--target pairs from our data. The max_length is simply the window size which is used in timeseries forecasting using RNN or LSTM.
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
token_ids = tokenizer.encode(txt) #1
for i in range(0, len(token_ids) - max_length, stride): #2
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self): #3
return len(self.input_ids)
def __getitem__(self, idx): #4
return self.input_ids[idx], self.target_ids[idx]
# max_length = number of token_ids consider for sliding windows
# stride = shifting of number of token_ids
# for example, stride=1, [tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
print(torch.cuda.is_available())
True
We also need to use dataloader for loading the dataset into LLM. We can feed the whole dataset (entire input:target pairs) or alternatively can load the data as batched input.
Let's create dataloader class.
After creating input:target pairs based on the size of max_length, there may exist some data whose total size is less than the max_length. drop_last= True excludes those data in our LLM. num_workers subdivides the training process of LLM into more than one processor. Recommendatation is 0.
def create_dataloader_v1(txt, batch_size=2, max_length=256,
stride=256, shuffle=True, drop_last=True,
num_workers=0):
tokenizer = tiktoken.get_encoding("gpt2") #1
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #2
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last, #3
num_workers=num_workers #4
)
return dataloader
Let's see the first and second input:target pairs of our data.
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
max_length = 6
dataloader = create_dataloader_v1(
raw_text, batch_size = 1, max_length = max_length, stride = max_length,
shuffle = False)
data_iter = iter(dataloader) #1
first_batch = next(data_iter)
print('input:',first_batch[0],"\n",'target:', first_batch[-1])
input: tensor([[ 40, 367, 2885, 1464, 1807, 3619]]) target: tensor([[ 367, 2885, 1464, 1807, 3619, 402]])
second_batch = next(data_iter)
print('input:',second_batch[0],"\n",'target:',second_batch[-1])
input: tensor([[ 402, 271, 10899, 2138, 257, 7026]]) target: tensor([[ 271, 10899, 2138, 257, 7026, 15632]])
The inputs and targets are converted into 2-Dimensional vectors with 1 row and max_length columns.
second_batch[0].shape
torch.Size([1, 6])
3. Token Embedding¶
Although tokenizer converts raw text into numeric values, it does not store any semantic relationship among words. For example, dog and bitch are more related to each other than they are related to mango. But mango is more related to the word 'apple' than the word 'cat'. In order to store these type of semantic relationship among tokens, embedding is used. Embedding is a better approach than one-hot encoding. Embedding converts each token_id into higher dimensional vectors. It is a great way to capture semantic relations among different words or tokens.
The gpt2 tokenizer has 50257 identical tokens. We will convert each token_id into a 256 dimensional vector. Original gpt model used 768 dimensions. The new model of gpt4 used 1024 dimension.
These embedding layers are part of the LLM and are updated (trained) during model training
# embedding is like a Look-Up table
vocab_size = 50257
output_dim = 256
torch.manual_seed(123)
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer.weight)
Parameter containing: tensor([[ 0.3374, -0.1778, -0.3035, ..., 1.3337, 0.0771, -0.0522], [ 0.2386, 0.1411, -1.3354, ..., -0.0315, -1.0640, 0.9417], [-1.3152, -0.0677, -0.1350, ..., -0.3181, -1.3936, 0.5226], ..., [ 0.5871, -0.0572, -1.1628, ..., -0.6887, -0.7364, 0.4479], [ 0.4438, 0.7411, 1.1263, ..., 1.2091, 0.6781, 0.3331], [-0.2537, 0.1446, 0.7203, ..., -0.2134, 0.2144, 0.3006]], requires_grad=True)
torch.manual_seed(123)
# Get embedding for token index 0
first_em = token_embedding_layer(torch.tensor([1]))
# Turn off scientific notation
torch.set_printoptions(sci_mode=False)
# Print the tensor
print(first_em)
tensor([[ 0.2386, 0.1411, -1.3354, -2.9340, 0.1141, -1.2072, -0.3008, 0.1427, -1.3027, -0.4919, -2.1429, 0.9488, -0.5684, -0.0646, 0.6647, -2.7836, 1.1366, 0.9089, 0.9494, 0.0266, -0.9221, 0.7034, -0.3659, -0.1965, -0.9207, 0.3154, -0.0217, 0.3441, 0.2271, -0.4597, -0.6183, 0.2461, -0.4055, -0.8368, 1.2277, -0.4297, -2.2121, -0.3780, 0.9838, -1.0895, 0.2017, 0.0221, -1.7753, -0.7490, 0.2781, -0.9621, -0.4223, -1.1036, 0.2473, 1.4549, -0.2835, -0.3767, -0.0306, -0.0894, -0.1965, -0.9713, 0.9005, -0.2523, 1.0669, -0.2985, 0.8558, 1.6098, -1.1893, 1.1677, 0.3277, -0.8331, -1.6179, 0.2265, -0.4382, 0.3265, -1.5786, -1.3995, 0.5446, -0.0830, -1.1753, 1.7825, 1.7524, -0.2135, 0.4095, 0.0465, 0.6367, -0.1943, -0.8614, 0.5338, 0.9376, -0.9225, 0.7047, -0.2722, 0.0144, -0.6411, 2.3902, -1.4256, -0.4619, -1.5539, -0.3338, 0.2405, 2.1065, 0.5509, -0.2936, -1.8027, -0.6933, 1.7409, 0.2698, 0.9595, -1.0253, -0.5505, 1.0264, -0.5670, -0.2658, -1.1116, -1.3696, -0.6534, -1.6125, -0.2284, 1.8388, -0.9473, 0.1419, 0.3696, -0.0174, -0.9575, -0.8169, -0.2866, 0.4343, -0.1340, -2.1467, -1.7984, -0.6822, -0.5191, 0.0093, -1.8110, -0.2443, 0.1327, 1.0875, -0.1029, 0.8604, 0.2078, 0.2027, 0.5021, -0.4063, 0.6664, 0.4765, -1.4498, 1.5446, 1.0394, 2.1681, 0.4884, 0.3359, -1.2282, -0.1200, 0.4884, 1.9431, 0.2169, -0.4743, -0.3679, -0.2918, -1.6531, 0.7692, -1.1323, 2.9590, 0.8171, 0.7668, 1.3258, 0.2103, 1.7876, -1.2128, 0.2045, 1.1051, -0.5454, 0.1073, 0.8727, -1.2800, -0.4619, 1.4342, -1.2103, 1.3834, 0.0324, 0.5421, 0.8796, 0.2713, 1.6067, -1.0004, 0.7392, -0.4931, 0.4073, -1.0394, -0.3226, 0.7226, 0.2674, -0.4673, 0.6916, -1.8752, 0.3008, -0.1468, 1.3672, 0.7074, 0.3276, 1.0658, 1.4130, -1.2445, 0.2227, 0.4593, -0.3845, 0.6554, -0.1045, -1.1134, 0.5110, 0.3566, 1.8591, -0.9300, 1.1186, 1.7495, 2.3058, 0.3734, 0.3314, -0.1871, 0.1770, 2.9641, 0.2307, 0.3228, 0.2610, 0.3219, 1.7745, 0.3155, -0.9364, 0.5687, -0.0959, 0.0046, -1.4321, -0.1535, -0.1925, -0.3115, -0.1812, -0.8745, -0.0270, 0.5424, 1.3656, -0.0284, -0.7411, -0.0169, 1.7024, 0.4206, 0.9317, 0.9884, -0.3948, 0.6919, 1.2310, -0.5126, -1.2635, 1.1440, 0.7619, 0.6543, -1.5402, -0.5176, -0.0315, -1.0640, 0.9417]], grad_fn=<EmbeddingBackward0>)
From the above, we can see that embedding is just a higher dimensional represntation of the tokens and works as a look up table.
4. Position Embedding¶
In order to capture the position of the words in a sentence, absolute (or relative) position encoding are use.
For example: I saw a saw to saw.
Words' position in a sentence carry special meaning and these information are stored by position encoding.
max_length = 1024
context_length = max_length
torch.manual_seed(123)
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
5. Input Embedding¶
The final process of data preprocessing is to create Input Embedding by summation of Token Embedding and Position Embedding.
The token_embedding has dimension of (batch_size, max_lenght, out_dim) and the position_embedding has a dimension of (max_lenght, out_dim). Therefore, they can be summed up mathematically.
Let's load the entire dataset in two batches and apply both token_embedding and position_embedding. By summing up two vectors, we will get input embeddings.
batch_size = 2
dataloader = create_dataloader_v1(
raw_text,
batch_size=batch_size,
max_length=max_length,
stride=max_length
)
for batch in dataloader:
x, y = batch
token_embeddings = token_embedding_layer(x)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
input_embeddings = token_embeddings + pos_embeddings
break
input_embeddings.shape
torch.Size([2, 1024, 256])
Comments
Post a Comment