To ground our investigation into quantization, it is important to reflect on exactly what we mean by “quantizing” numbers. So far we’ve discussed that through quantization, we take a set of high-precision values and map them to a lower precision in such a way that best preserves their relationships, but we have not zoomed into the mechanics of this operation. Unsurprisingly, we find there are nuances and design choices to be made concerning how we remap values into the quantized space, which vary depending on use case. In this section, we will seek to understand the knobs and levers which guide the quantization process, so that we can better understand the research and equip ourselves to bring educated decision making into our deployments.
Bit Width
Throughout our discussion on quantization, we will refer to the bit widths of the quantized values, which represents the number of bits available to express the value. A bit can only store a binary value of 0 or 1, but sets of bits can have their combinations interpreted as incremental integers. For instance, having 2 bits allows for 4 total combinations ({0, 0}, {0, 1}, {1, 0}, {1, 1}) which can represent integers in the range [0, 3]. As we add N bits, we get 2 to the power of N possible combinations, so an 8-bit integer can represent 256 numbers. While unsigned integers will count from zero to the maximum value, signed integers will place zero at the center of the range by interpreting the first bit as the +/- sign. Therefore, an unsigned 8-bit integer has a range of [0, 255], and a signed 8-bit integer spans from [-128, 127].
This fundamental knowledge of how bits represent information will help us to contextualize the numeric spaces that the floating point values get mapped to in the techniques we study, as when we hear that a network layer is quantized to 4 bits, we understand that the destination space has 2 to the power of 4 (16) discrete values. In quantization, these values do not necessarily represent integer values for the quantized weights, and often refer to the indices of the quantization levels – the “buckets” into which the values of the input distribution are mapped. Each index corresponds to a codeword that represents a specific quantized value within the predefined numeric space. Together, these codewords form a codebook, and the values obtained from the codebook can be either floating point or integer values, depending on the type of arithmetic to be performed. The thresholds that define the buckets depend on the chosen quantization function, as we will see. Note that codeword and codebook are general terms, and that in most cases the codeword will be the same as the value returned from the codebook.
Floating-Point, Fixed-Point, and Integer-Only Quantization
Now that we understand bit widths, we should take a moment to touch on the distinctions between floating-point, fixed-point, and integer-only quantization, so that we are clear on their meaning. While representing integers with binary bits is straightforward, operating on numbers with fractional components is a bit more complex. Both floating-point and fixed-point data types have been designed to do this, and selecting between them depends on both on the deployment hardware and desired accuracy-efficiency tradeoff, as not all hardware supports floating-point operations, and fixed-point arithmetic can offer more power efficiency at the cost of reduced numeric range and precision.
Floating-point numbers allocate their bits to represent three pieces of information: the sign, the exponent, and the mantissa, which enables efficient bitwise operations on their representative values. The number of bits in the exponent define the magnitude of the numeric range, and the number of mantissa bits define the level of precision. As one example, the IEEE 754 standard for a 32-bit floating point (FP32) gives the first bit to the sign, 8 bits to the exponent, and the remaining 23 bits to the mantissa. Floating-point values are “floating” because they store an exponent for each individual number, allowing the position of the radix point to “float,” akin to how scientific notation moves the decimal in base 10, but different in that computers operate in base 2 (binary). This flexibility enables precise representation of a wide range of values, especially near zero, which underscores the importance of normalization in various applications.
In contrast, “fixed” point precision does not use a dynamic scaling factor, and instead allocates bits into sign, integer, and fractional (often still referred to as mantissa) components. While this means higher efficiency and power-saving operations, the dynamic range and precision will suffer. To understand this, imagine that you want to represent a number which is as close to zero as possible. In order to do so, you would carry the decimal place out as far as you could. Floating-points are free to use increasingly negative exponents to push the decimal further to the left and provide extra resolution in this situation, but the fixed-point value is stuck with the precision offered by a fixed number of fractional bits.
Integers can be considered an extreme case of fixed-point where no bits are given to the fractional component. In fact, fixed-point bits can be operated on directly as if they were an integer, and the result can be rescaled with software to achieve the correct fixed-point result. Since integer arithmetic is more power-efficient on hardware, neural network quantization research favors integer-only quantization, converting the original float values into integers, rather than the fixed-point floats, because their calculations will ultimately be equivalent, but the integer-only math can be performed more efficiently with less power. This is particularly important for deployment on battery-powered devices, which also often contain hardware that only supports integer arithmetic.
Uniform Quantization
To quantize a set of numbers, we must first define a quantization function Q(r), where r is the real number (weight or activation) to be quantized. The most common quantization function is shown below:
Typical quantization function. Image by author.
In this formula, Z represents an integer zero-point, and S is the scaling factor. In symmetrical quantization, Z is simply set to zero, and cancels out of the equation, while for asymmetrical quantization, Z is used to offset the zero point, allowing for focusing more of the quantization range on either the positive or negative side of the input distribution. This asymmetry can be extremely useful in certain cases, for example when quantizing post-ReLU activation signals, which contain only positive numbers. The Int(·) function assigns a scaled continuous value to an integer, typically through rounding, but in some cases following more complex procedures, as we will encounter later.
Choosing the correct scaling factor (S) is non-trivial, and requires careful consideration of the distribution of values to be quantized. Because the quantized output space has a finite range of values (or quantization levels) to map the inputs to, a clipping range [α, β] must be established that provides a good fit for the incoming value distribution. The chosen clipping range must strike a balance between not over-clamping extreme input values and not oversaturating the quantization levels by allocating too many bits to the long tails. For now, we consider uniform quantization, where the bucketing thresholds, or quantization steps, are evenly spaced. The calculation of the scaling factor is as follows:
Formula for calculating the quantization function’s scaling factor (S) based on the clipping range ([α, β]) and desired bit-width (b). Image by author.
The shapes of trained parameter distributions can vary widely between networks and are influenced by a number of factors. The activation signals generated by those weights are even more dynamic and unpredictable, making any assumptions about the correct clipping ranges difficult. This is why we must calibrate the clipping range based on our model and data. For best accuracy, practitioners may choose to calibrate the clipping range for activations online during inference, known as dynamic quantization. As one might expect, this comes with extra computational overhead, and is therefore by far less popular than static quantization, where the clipping range is calibrated ahead of time, and fixed during inference.
Dequantization
Here we establish the reverse uniform quantization operation which decodes the quantized values back into the original numeric space, albeit imperfectly, since the rounding operation is non-reversible. We can decode our approximate values using the following formula:
Dequantization operation. Image by author.
Non-Uniform Quantization
The astute reader will probably have noticed that enacting uniformly-spaced bucketing thresholds on an input distribution that is any shape other than uniform will lead to some…