Quantizing Large Language Models (LLMs) is a popular method for reducing the size and improving the speed of these models during inference. One effective technique is GPTQ, which delivers impressive performance on GPUs. Compared to unquantized models, GPTQ uses almost 3 times less VRAM while maintaining similar accuracy and faster generation. It has become so popular that it is now integrated into the transformers library.
ExLlamaV2 is a library designed to further optimize GPTQ and improve its performance. It introduces a new quantization format called EXL2, which offers more flexibility in how weights are stored.
In this article, we will explore how to quantize base models using the EXL2 format and how to run them. The code for this tutorial can be found on GitHub and Google Colab.
To begin, we need to install the ExLlamaV2 library. Since we want to use some scripts from the repository, we will install it from source using the following commands:
“`
git clone https://github.com/turboderp/exllamav2
pip install exllamav2
“`
Once ExLlamaV2 is installed, we need to download the model we want to quantize in the EXL2 format. Let’s use the zephyr-7B-beta model, which is a Mistral-7B model fine-tuned using Direct Preference Optimization (DPO). It outperforms the Llama-2 70b chat on the MT bench, even though it is ten times smaller. You can try out the base Zephyr model in this space.
To download the zephyr-7B-beta model, use the following command (please note that this may take some time as the model is about 15 GB in size):
“`
git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
“`
GPTQ requires a calibration dataset, which is used to measure the impact of the quantization process. We will use the wikitext dataset and download the test file with the following command:
“`
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet
“`
Once the download is complete, we can use the `convert.py` script provided by the ExLlamaV2 library to quantize the model. We need to specify the path of the base model, the working directory for temporary files and the final output, the path of the calibration dataset, and the target average number of bits per weight. For example, to quantize the model with 5.0 bits per weight, use the following command:
“`
mkdir quant
python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0
“`
Please note that you will need a GPU to quantize the model. The official documentation recommends approximately 8 GB of VRAM for a 7B model and 24 GB of VRAM for a 70B model. On Google Colab, it took about 2 hours and 10 minutes to quantize the zephyr-7b-beta model using a T4 GPU.
Under the hood, ExLlamaV2 uses the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. The “EXL2” format offers additional flexibility by supporting different levels of quantization and allowing different precisions within a model and within each layer. ExLlamaV2 leverages this flexibility during quantization to find the best parameters that minimize the error and achieve the target average number of bits per weight.
Once the model is quantized, we can run it to see how it performs. Before that, we need to copy essential config files from the base_model directory to the new quant directory. We can do this with the following commands:
“`
!rm -rf quant/out_tensor
!rsync -av –exclude=’*.safetensors’ –exclude=’.*’ ./base_model/ ./quant/
“`
Now our EXL2 model is ready, and we have multiple options for running it. The simplest method is to use the `test_inference.py` script in the ExLlamaV2 repository. For example, to generate text with the prompt “I have a dream”, use the following command:
“`
python exllamav2/test_inference.py -m quant/ -p “I have a dream”
“`
The generation is fast, even compared to other quantization techniques and tools. In my case, the LLM returned the following output:
“`
— Model: quant/
— Options: [‘rope_scale 1.0’, ‘rope_alpha 1.0’]
— Loading model…
— Loading tokenizer…
— Warmup…
— Generating…
I have a dream. <|user|>Wow, that’s an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let’s make this speech truly unforgettable! Absolutely! Here’s your updated speech: Dear fellow citizens, Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors– Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)
“`
Alternatively, you can use the `chatcode.py` script for more flexibility. For example, to use the chat mode, use the following command:
“`
python exllamav2/examples/chatcode.py -m quant -mode llama
“`
If you plan to use an EXL2 model regularly, ExLlamaV2 has been integrated into several backends, such as oobabooga’s text generation web UI. Please note that it requires FlashAttention 2, which currently requires CUDA 12.1 on Windows.
Once you have tested the model and are satisfied with its performance, you can upload it to the Hugging Face Hub. Use the following code snippet, replacing the repo ID with your desired name, to create a new repository and upload the model:
“`
from huggingface_hub import notebook_login
from huggingface_hub import HfApi
notebook_login()
api = HfApi()
api.create_repo(repo_id=”mlabonne/zephyr-7b-beta-5.0bpw-exl2″, repo_type=”model”)
api.upload_folder(repo_id=”mlabonne/zephyr-7b-beta-5.0bpw-exl2″, folder_path=”quant”)
“`
That’s it! Your model is now available on the Hugging Face Hub. The code provided in this article is general and can be used to quantize different models with different values of bits per weight (bpw). This flexibility allows you to create models optimized for your specific hardware.
In this article, we introduced ExLlamaV2, a powerful library for quantizing LLMs. It not only optimizes the quantization process but also provides the highest number of tokens per second compared to other solutions. We applied ExLlamaV2 to the zephyr-7B-beta model and created a 5.0 bpw version using the EXL2 format. After quantization, we tested the model’s performance and uploaded it to the Hugging Face Hub.
If you’re interested in more technical content about LLMs, follow the author on Medium.
Source link