“`html
There is a new quantization algorithm in town! The Additive Quantization of Language Models (AQLM) [1] quantization procedure was released in early February 2024 and has already been integrated to HuggingFace Transformers (as of version 4.38.0–21/02/2024) and HuggingFace PEFT (as of version 0.9.0–28/02/2024). This means that checkpoints quantized using AQLM can be loaded using these libraries and HuggingFace Transformers can be used to quantize compatible checkpoints using AQLM.
Photo by JJ Ying on Unsplash
In this blog post, we will examine the key results presented in the AQLM paper [1] and provide a detailed overview of the key concepts behind this new quantization technique.
In this article, we will first review the key results presented in the AQLM paper. Next, we will examine the motivations for quantizing large language models for inference. We will then dive into the details of Multi-Codebook Quantization (MCQ), a technique uniquely leveraged by AQLM for weight quantization. After breaking down the memory footprint of AQLM models and examining key quantization parameters, we will explain the AQLM quantization procedure step-by-step. Finally, we will discuss the concept of Pareto efficiency as it relates to model quantization, providing perspective on how AQLM pushes the boundaries of Pareto-optimal quantization.
Existing weight-only quantization algorithms could technically quantize model weights down to the 2-bit range. However, they failed at effectively preserving model accuracy. AQLM is a new weight-only post-training quantization (PTQ) algorithm that sets a new state-of-the-art for the 2 bit-per-parameter range. It also provides smaller benchmark improvements compared to existing methods for the 3-bit and 4-bit ranges (Table 1). Specifically, AQLM outperforms popular algorithms like GPTQ [2] as well as more recent but lesser known methods such as QuIP [3] and QuIP# [4]. AQLM authors also claim that their quantization algorithm pushes the Pareto frontier of the tradeoff between model accuracy and memory footprint below 3 bits per parameter for the first time.
The table below summarizes the performance of AQLM when compressing the Llama-2–70B model to 4-bit, 3-bit, and 2-bit per parameter. Performance is measured by perplexity on the WikiText2 [5] and C4 [6] datasets (lower is better) as well as zero-shot accuracy on the WinoGrande [7] and HellaSwag [8] benchmarks (higher is better). For comparison, the performance of QuIP#, the top competing method, is shown for 4-bit and 2-bit compression. Since the available QuIP# implementation does not support 3-bit compression, SpQR [9] is included as the comparison method for AQLM at 3 bits.
Table 1 —AQLM vs. top competitor on Llama-2–70B compressed at 2, 3 and 4 bits per parameter
While quantization can sometimes reduce inference latency compared to FP16, this is not guaranteed. In benchmarks, AQLM-quantized models showed moderate latency improvements, with speedups ranging from 1.2x to 2x in most cases, and up to 3.05x in the best case. However, latency reduction was not the focus of AQLM’s designers. Their priority was maximizing accuracy within a target model size, rather than optimizing for speed. Consequently, the latency gains from AQLM quantization are noticeable but not as dramatic as the improvements from other existing quantization algorithms.
Nevertheless, AQLM marks an important step towards making large language models more accessible on consumer hardware and mobile devices. For example, when quantizing a 7B model from 16-bit half precision formats like FP16 (16 bits or 2 bytes per parameter) down to just 2 bits per parameter (0.25 bytes per parameter), the memory footprint is reduced by a factor of 8x — decreasing from 14GB down to only 1.75GB.
PTQ methods fall into two categories: those that quantize just the model weights, and those that quantize both weights and activations. AQLM falls into the first category, only quantizing weights. Model weights are static by definition, so they can be quantized offline before deployment and even distributed on platforms such as the HuggingFace Model Hub. Activations encompass everything else, including the key-value (KV) cache, and are only known at runtime during inference.
The first checkpoints quantized (mostly to 2 bits) using AQLM have started to appear on the HF Hub. However, TheBloke, a popular model quantizer, has not yet included this quantization technique in his set of quantization methods.
When quantizing LLMs weights, not all the weights are actually quantized. Only the parameters that make up the bulk of the parameter count, like the large projection matrices of both the attention and feed-forward layers, are typically quantized. Other parameters are usually kept in native precision.
When opting for weight-only quantization, efficient mixed precision kernels for matrix multiplications are usually not available. As a result, quantized weights are dequantized at runtime after being fetched from memory. Depending on the overhead of dequantization, the latency reductions from lower data transfer can be partially preserved or completely offset.
There are four main benefits associated with the reduced weight memory footprint of quantized models for LLM inference:
- Reduced hardware requirements for model serving: A quantized model can be served using less expensive GPUs or even made accessible on consumer devices or mobile platforms.
- Increased space for the KV cache to enable larger batch sizes and/or sequence lengths.
- Faster decoding latency. As the decoding process is memory bandwidth bound, less data movement from reduced weight sizes directly improves this, unless offset by dequantization overhead.
- A higher compute-to-memory access ratio (through reduced data movement), known as arithmetic intensity. This allows for fuller utilization of available compute resources during decoding.
AQLM applies Multi-Codebook Quantization (MCQ) to compress the weights of LLMs. Originally, MCQ was developed to enable efficient nearest neighbor search on vector databases. It works by splitting each vector of the database into subgroups (sub-vectors), which are in turn approximated using learned vectors named codewords. A codebook is a set of such codewords. This allows similarity computations to be performed efficiently using the finite set of codewords instead of the full vector database.
In AQLM, the vectors that are quantized correspond to the rows of the weight matrices. That is, AQLM quantizes the output channels of each weight matrix using MCQ.
Note: It should be noted that AQLM uses the W.X notation convention (W and X are the weight and activation matrices respectively), whereas some other quantization papers use the reverse X.W convention. This means the output channels that AQLM quantizes correspond to the rows of the weight matrix, while in X.W notation, they would be the columns.
Each row of the weight matrix of shape (d_out, d_in) is divided into sub-vectors called groups of size (1, g). Assuming the codebooks have already been learned, AQLM approximates each group as the sum of M same-size codewords that are stored at native precision. Each codeword belongs to a different codebook, each codebook containing 2^B codewords. To reconstruct a group using the learned codebooks, we actually only need to store the index of each constituent codeword in its codebook. This index can be represented as a 2^B-dimensional one-hot vector called a code. So each group is represented by M one-hot code vectors of size 2^B. Storing such a one-hot vector requires B bits. Therefore, the total memory footprint to store the compressed representation of each group is M x B bits.
The process of building the quantized representation in AQLM is summarized in Figure 1. It should be noted that before splitting each output channel into groups, the output channels are scaled by a learned scaling factor.
Figure 1 — Multi-codebook encoding of a parameter group (d_in=9, d_out=4, g=3, M=3, B=2) — Figure by author
As mentioned previously, at inference time, the matrix multiplication with activations X uses dequantized, native-precision parameters rather than the quantized code vectors. As shown in Figure 2, the dequantization process works by decompressing the code vectors back into one-hot index vectors to retrieve the corresponding codewords from each codebook. These codewords are summed together, then scaled to reproduce the original, half-precision weight values for computation.
Figure 2 — Decoding of a parameter group from codebook indices (codes) (d_in=9, d_out=4, g=3, M=3, B=2) — Figure by author
Most importantly, what is the achieved average number of bits per parameter using AQLM? To store an AQLM-quantized weight matrix, the following information needs to be stored:
- M codebooks, each containing 2^B codewords stored at native 16-bit precision. Each codeword has size (1, g).
- d_out scaling factors, each stored as a 16-bit float
- M code vectors of B bits each to encode each group, of which there are total d_out x d_in/g.
Therefore, the average number of bits per parameter can be calculated with the following formula:
It should…
“`
Source link