QLoRA â How to Fine-Tune an LLM on a Single GPU | by Shaw Talebi

QLoRA â How to Fine-Tune an LLM on a Single GPU | by Shaw Talebi | Feb, 2024

To import modules from various libraries, we use the following code snippets:

“`python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers
“`

Additionally, we may need to install dependencies for these modules to work correctly:

“`python
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
“`

To load the base model and tokenizer, we can use the following code:

“`python
model_name = “TheBloke/Mistral-7B-Instruct-v0.2-GPTQ”
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=”auto”, trust_remote_code=False, revision=”main”)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
“`

To use the base model for text generation, we can follow these steps:

“`python
model.eval()
comment = “Great content, thank you!”
prompt = f”'[INST] {comment} [/INST]”’
inputs = tokenizer(prompt, return_tensors=”pt”)
outputs = model.generate(input_ids=inputs[“input_ids”].to(“cuda”), max_new_tokens=140)
print(tokenizer.batch_decode(outputs)[0])
“`

Prompt engineering can improve model responses, as shown in the code snippet below:

“`python
intstructions_string = f”””Instructions here…Please respond to the following comment.”””
prompt_template = lambda comment: f”'[INST] {intstructions_string}\n{comment}\n[/INST]”’
prompt = prompt_template(comment)
“`

To prepare the model for training, we enable gradient checkpointing and quantized training:

“`python
model.train()
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
“`

Finally, after fine-tuning the model and preparing the dataset, we can load and use the fine-tuned model for inference:

“`python
model_name = “TheBloke/Mistral-7B-Instruct-v0.2-GPTQ”
config = PeftConfig.from_pretrained(“shawhin/shawgpt-ft”)
model = PeftModel.from_pretrained(model, “shawhin/shawgpt-ft”)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
“`

These snippets demonstrate how to work with the Hugging Face libraries, load models, engineer prompts, prepare models for training, and use fine-tuned models for inference.

Source link