Fine-Tune Large Language Model
Chap-12: Fine-Tuning Generation Models
import torch
print(torch.cuda.is_available()) # Should print True
print(torch.version.cuda) # Check CUDA version
print(torch.__version__) # PyTorch version
True 12.1 2.3.1+cu121
From the above, we can see that the CUDA version is 12.1, and pytorch version is 2.3.1.
We now conduct our instruction fine-tuning using QLoRA. Q-LoRA where Q stands for quantization of the model's weight parameters. It is a techniqe to reduce the size of model (a compression technique). Quantization makes sure the reduction of bytes of data. A high level understanding is that you can reduce the PI values 3.14159265358979323846264338 to 3.1416. This reduction of course may somehow affect the model which can be neglected in most cases. However, quantization saves a lot of memory.
Sebastian mentioned in his blog post that QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 39% increase in runtime. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
LoRA stands for Low Rank Adapter. In the paper "Parameter-efficient Transfer Learning for NLP", the authors introduced adapters, a technique that enables fine-tuning a small fraction of a model’s parameters while maintaining strong performance. Their study demonstrated that updating just 3.6% of BERT’s parameters for a given task can achieve results comparable to full fine-tuning. On the GLUE benchmark, they reported performance within 0.4% of traditional fine-tuning.
Finetune on instruction model is better than the finetuning of base or pretrained model. The instruction model already finetuned to follow instruction prompt from user.
We will use Tinyllama model which was fine tuned on specific prompt template. We need to format the dataset which will be used for fine-tune is formatted in the same manner. The format_prompt function converted the dataset to model specific template.
The model and tokenizer used here is from "TinyLlama/TinyLlama-1.1B-Chat-v1.0". "HuggingFaceH4/ultrachat_200k" dataset is downloaded from Huggingface and only taken test_sft. The data is split in four portion. The first 3000 rows of data was considered for this study.
1. Data preparation¶
from transformers import AutoTokenizer
from datasets import load_dataset
# Load a tokenizer to use its chat template
template_tokenizer = AutoTokenizer.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0"
)
def format_prompt(example):
"""Format the prompt to using the <|user|> template TinyLLama is using"""
# Format answers
chat = example["messages"]
prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
return {"text": prompt}
# Load and format the data using the template TinyLLama is using
dataset = (
load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
.shuffle(seed=42)
.select(range(3_000))
)
dataset = dataset.map(format_prompt)
# Example of formatted prompt
print(dataset["text"][2576])
2. Quantize model loading¶
In order to load the quantized model in 4 bit BitsAndBytesConfig library is used. The model is loaded in 4 bit with double quantization for further compression. The compute data type is float16 or 16 bit to ensure that the finetuned adapter model can be later merged with base model.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Use 4-bit precision model loading
bnb_4bit_quant_type="nf4", # Quantization type
bnb_4bit_compute_dtype="float16", # Compute dtype
bnb_4bit_use_double_quant=True, # Apply nested quantization
)
# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
# Leave this out for regular SFT
quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"
3. LoRA using parametric efficient fine tune (peft)¶
To proceed with fine-tuning using LoRA (Low-Rank Adaptation), we need to find the hyperparameters that will control the adaptation process. This is done using the PEFT (Parameter-Efficient Fine-Tuning) library, which enables efficient fine-tuning of LLM by updating only a fraction of their parameters.
Defining the LoRA Configuration in PEFT LoRA works by introducing trainable low-rank matrices into the transformer layers while keeping most of the original model parameters frozen. This significantly reduces computational requirements and memory usage while still allowing effective model adaptation.
To set up the LoRA configuration, we need to specify major hyperparameters such as:
r (Rank): Determines the size of the low-rank update matrices. A lower value reduces memory usage but may limit fine-tuning flexibility.Values typically range between 4 and 64.
lora_alpha (Scaling Factor): Controls how much influence the LoRA weights have on the model’s outputs. A rule of thumb is to choose a value twice the size of r.
lora_dropout (Dropout Rate): Introduces regularization to prevent overfitting during fine-tuning.
bias (Bias Training Strategy): Specifies whether bias parameters should also be updated (none, all, or lora_only).
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
# Prepare LoRA Configuration
peft_config = LoraConfig(
lora_alpha=32, # LoRA Scaling
lora_dropout=0.1, # Dropout for LoRA Layers
r=64, # Rank
bias="none",
task_type="CAUSAL_LM",
target_modules= # Layers to target
["k_proj", "gate_proj", "v_proj", "up_proj", "q_proj", "o_proj", "down_proj"]
)
# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
4. Superfized Training¶
When fine-tuning a language model, choosing the right training settings makes a big difference. The TrainingArguments class in Hugging Face’s Transformers library helps define these settings easily.
Key Settings Explained Batch Size & Gradient Accumulation
Since big models use a lot of memory, we set a small batch size (batch_size=2). To compensate, we use gradient accumulation (gradient_accumulation_steps=4), which means the model updates weights after every four steps instead of every step. Optimizer & Learning Rate
"paged_adamw_32bit" is a memory-efficient optimizer. The learning rate is set to 2e-4, and we use a cosine learning rate schedule, which means the learning rate starts high and gradually decreases. Memory Optimization
fp16=True: Uses mixed precision to speed up training and save memory. gradient_checkpointing=True: Saves memory by recomputing activations instead of storing them.
from transformers import TrainingArguments
output_dir = "./results"
# Training arguments
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
num_train_epochs=1,
logging_steps=10,
fp16=True,
gradient_checkpointing=True
)
Finally we finetune the model using SFT trainer or supervised fine tune. max_seq_length is set to 512 but can be changed depends on model being used. The peft_config is used because we want to fine tune the Quantized model and only the trainable parameters using LoRA adapter.
The adapter fine-tuned model is saved as TinyLlama-1.1B-qlora.
from trl import SFTTrainer
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
max_seq_length=512,
# Leave this out for regular SFT
peft_config=peft_config,
)
# Train model
trainer.train()
# Save QLoRA weights
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")
5. Merged weight¶
Finally, the finetuned adapter is merged with base model. The model is reloaded in 16 bit in stead of quantized 4 bit. No peft_config is added during the loading of the model.
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
"TinyLlama-1.1B-qlora",
low_cpu_mem_usage=True,
device_map="auto",
)
# Merge LoRA and base model
merged_model = model.merge_and_unload()
The fine-tuned model is used using pipeline for text generation to follow instruction
from transformers import pipeline
# Use our predefined prompt template
prompt = """<|user|>
Create an energyplus IDF material with thickness 0.005 m.</s>
<|assistant|>
"""
# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])
Comments
Post a Comment