Saturday, May 10, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

ExLlamaV2: The Fastest Library to Run LLMs

November 20, 2023
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter



Quantizing Large Language Models (LLMs) is a popular method for reducing the size and improving the speed of these models during inference. One effective technique is GPTQ, which delivers impressive performance on GPUs. Compared to unquantized models, GPTQ uses almost 3 times less VRAM while maintaining similar accuracy and faster generation. It has become so popular that it is now integrated into the transformers library.

ExLlamaV2 is a library designed to further optimize GPTQ and improve its performance. It introduces a new quantization format called EXL2, which offers more flexibility in how weights are stored.

In this article, we will explore how to quantize base models using the EXL2 format and how to run them. The code for this tutorial can be found on GitHub and Google Colab.

To begin, we need to install the ExLlamaV2 library. Since we want to use some scripts from the repository, we will install it from source using the following commands:

“`
git clone https://github.com/turboderp/exllamav2
pip install exllamav2
“`

Once ExLlamaV2 is installed, we need to download the model we want to quantize in the EXL2 format. Let’s use the zephyr-7B-beta model, which is a Mistral-7B model fine-tuned using Direct Preference Optimization (DPO). It outperforms the Llama-2 70b chat on the MT bench, even though it is ten times smaller. You can try out the base Zephyr model in this space.

To download the zephyr-7B-beta model, use the following command (please note that this may take some time as the model is about 15 GB in size):

“`
git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
“`

GPTQ requires a calibration dataset, which is used to measure the impact of the quantization process. We will use the wikitext dataset and download the test file with the following command:

“`
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet
“`

Once the download is complete, we can use the `convert.py` script provided by the ExLlamaV2 library to quantize the model. We need to specify the path of the base model, the working directory for temporary files and the final output, the path of the calibration dataset, and the target average number of bits per weight. For example, to quantize the model with 5.0 bits per weight, use the following command:

“`
mkdir quant
python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0
“`

Please note that you will need a GPU to quantize the model. The official documentation recommends approximately 8 GB of VRAM for a 7B model and 24 GB of VRAM for a 70B model. On Google Colab, it took about 2 hours and 10 minutes to quantize the zephyr-7b-beta model using a T4 GPU.

Under the hood, ExLlamaV2 uses the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. The “EXL2” format offers additional flexibility by supporting different levels of quantization and allowing different precisions within a model and within each layer. ExLlamaV2 leverages this flexibility during quantization to find the best parameters that minimize the error and achieve the target average number of bits per weight.

Once the model is quantized, we can run it to see how it performs. Before that, we need to copy essential config files from the base_model directory to the new quant directory. We can do this with the following commands:

“`
!rm -rf quant/out_tensor
!rsync -av –exclude=’*.safetensors’ –exclude=’.*’ ./base_model/ ./quant/
“`

Now our EXL2 model is ready, and we have multiple options for running it. The simplest method is to use the `test_inference.py` script in the ExLlamaV2 repository. For example, to generate text with the prompt “I have a dream”, use the following command:

“`
python exllamav2/test_inference.py -m quant/ -p “I have a dream”
“`

The generation is fast, even compared to other quantization techniques and tools. In my case, the LLM returned the following output:

“`
— Model: quant/
— Options: [‘rope_scale 1.0’, ‘rope_alpha 1.0’]
— Loading model…
— Loading tokenizer…
— Warmup…
— Generating…
I have a dream. <|user|>Wow, that’s an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let’s make this speech truly unforgettable! Absolutely! Here’s your updated speech: Dear fellow citizens, Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors– Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)
“`

Alternatively, you can use the `chatcode.py` script for more flexibility. For example, to use the chat mode, use the following command:

“`
python exllamav2/examples/chatcode.py -m quant -mode llama
“`

If you plan to use an EXL2 model regularly, ExLlamaV2 has been integrated into several backends, such as oobabooga’s text generation web UI. Please note that it requires FlashAttention 2, which currently requires CUDA 12.1 on Windows.

Once you have tested the model and are satisfied with its performance, you can upload it to the Hugging Face Hub. Use the following code snippet, replacing the repo ID with your desired name, to create a new repository and upload the model:

“`
from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(repo_id=”mlabonne/zephyr-7b-beta-5.0bpw-exl2″, repo_type=”model”)
api.upload_folder(repo_id=”mlabonne/zephyr-7b-beta-5.0bpw-exl2″, folder_path=”quant”)
“`

That’s it! Your model is now available on the Hugging Face Hub. The code provided in this article is general and can be used to quantize different models with different values of bits per weight (bpw). This flexibility allows you to create models optimized for your specific hardware.

In this article, we introduced ExLlamaV2, a powerful library for quantizing LLMs. It not only optimizes the quantization process but also provides the highest number of tokens per second compared to other solutions. We applied ExLlamaV2 to the zephyr-7B-beta model and created a 5.0 bpw version using the EXL2 format. After quantization, we tested the model’s performance and uploaded it to the Hugging Face Hub.

If you’re interested in more technical content about LLMs, follow the author on Medium.



Source link

Tags: ExLlamaV2FastestLibraryLLMsRun
Previous Post

Zhejiang University Researchers Propose UrbanGIRAFFE to Tackle Controllable 3D Aware Image Synthesis for Challenging Urban Scenes

Next Post

Episode 1: Introduction For Cloud Computing I Cloud Tutorial For Begginers [TAGALOG]

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Episode 1: Introduction For Cloud Computing I Cloud Tutorial For Begginers [TAGALOG]

Episode 1: Introduction For Cloud Computing I Cloud Tutorial For Begginers [TAGALOG]

Musk Defends Himself on X After Antisemitic Furor Deepens

Musk Defends Himself on X After Antisemitic Furor Deepens

The Untold Story Of Samsung’s Growing Chip Business

The Untold Story Of Samsung’s Growing Chip Business

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In