return to table of content

A Visual Guide to LLM Quantization

llm_trw
4 replies
1d3h

This is a very misleading article.

Floats are not distributed evenly across the number line. The number of floats between 0 and 1 is the same as the number of floats between 1 and 3, then between 3 and 7 and so on. Quantising well to integers means that you take this sensitivity into account since the spacing between integers is always the same.

a1369209993
3 replies
23h49m

The number of floats between 0 and 1 is the same as the number of floats between 1 and 3

No, the number of floats between 0 and 1 is (approximately) the same as the number of floats between 1 and positive infinity. And this is the correct way for it work: 1/x has roughly the same range and precision as x, so you don't need (as many) stupid obfuscatory algebraic transforms in your formulas to keep your intermediate values from over- or under-flowing.

llm_trw
2 replies
22h47m

Floating points numbers have a fix precision mantissa, and a fixed precision exponent.

So you have xxxxx E xxx as an example of a 5 bit mantissa and 3 bit exponent.

You have 2^5 floating point numbers for each possible exponent.

So no, you're wrong. For exponent 0 you have 2^5, and for exponent 1, 10 and 11 you then have the same. The exponent 0b (0d) then contain the same number of possible floating mantissas as does 1b (1d), 10b (2d) and 11b(3d). Which means that there are as many mantissas between [0,1) as there are between [1,3)

nh23423fefe
1 replies
22h35m

why do you think the range [0,1) is represented by one exponent?

llm_trw
0 replies
22h29m

Because it is half the range expressed in 1 bit of exponent, the same way that [1,3) is half the range expressed in 2 bits of exponent. I'd used [0,2) and [2,4) but that would confuse people used to thinking in base 10, which includes the OP author apparently.

jillesvangurp
2 replies
1d10h

Fairly helpful overview. One thing that probably has a good answer is why to use floats at all; even at 32 bits? Is there an advantage relative to using just 32 bit ints? It seems integer math is a lot easier to do in hardware. Back when I was young, you had to pay extra to get floating point hardware support in your PC. It required a co-processor. I'm assuming that is still somewhat true in terms of numbers of transistors needed on chips.

Intuitively, I like the idea of asymmetric scales as well. Treating all values as equal seems like it's probably wasteful in terms of memory. It would be interesting to see where typical values fall statistically in an LLM. I bet it's nowhere near a random distribution of values.

jsjohnst
0 replies
1d6h

One thing that probably has a good answer is why to use floats at all; even at 32 bits? Is there an advantage relative to using just 32 bit ints?

Sibling commenter gave a better detailed answer, but I will share a succinct tl;dr in case that’s more your desire.

INT32 maximum value: 2,147,483,647

FP32 maximum value: 3.4028235 x 10^38

If you need to exactly represent all digits between 10,000,000 and 1,000,000,000, then INT32 will handle it fine, but FP32 won’t. But instead if you need to represent a range of values from 1.00 to 35,003,986,674,493.00 and it’s ok to just be directionally accurate, FP32 has you covered.

adrian_b
0 replies
1d7h

At any given number of bits used for representation, using floating-point numbers instead of fixed-point numbers (integers are a special case of the latter) increases the so-called dynamic range, i.e. the ratio between the greatest and the smallest representable numbers.

This advantage is paid by increased distances between neighbor numbers inside the subranges, because the number of representable numbers is the same for floating-point and fixed-point, but the floating-point numbers are spread over their wider dynamic range.

Depending on the application, either the disadvantages or the advantages of a greater dynamic range are more important, which determines the choice of floating-point or integers (actually fixed-point), and when floating-point numbers are chosen, one can allocate more or less bits for the exponent depending on whether the dynamic range or the rounding errors are more important.

For ML/AI applications, it appears that the dynamic range is much more important than the rounding errors, which has caused the use of the Google BF16 format, which has great dynamic range and big rounding errors, instead of the IEEE FP16, which has a smaller dynamic range and smaller rounding errors, and which is preferable for other applications, like graphics (mainly for color component encoding), where the rounding errors of BF16 would be unacceptable.

In the parent article, there is a figure that is confusing, because in it the dynamic range appears to be the difference between the positive number and the negative number with the greatest absolute values.

This is very wrong. The dynamic range is the ratio between the (strictly) positive numbers with the greatest and the smallest absolute values. The dynamic range can be computed by subtraction only on a logarithmic scale, which is why in practice it is frequently expressed in decibels.

For instance, for INT8, the dynamic range is not (+127)-(-127)=254 as it appears in that figure, but it is 127 divided by 1, i.e. 127. Similarly, for FP16, the dynamic range is not (+65504)-(-65504)=131008 as it appears in that figure, but it is 65504 divided by 2^(-14), i.e. 1073217536, a much larger value, which demonstrates the advantage in dynamic range of FP16 over INT16 (the dynamic range of the latter is 32767).

With a dynamic range defined like in that figure, there would be no advantages for floating-point or for BF16, because with an implicit scale factor taken into account, one could make that "dynamic range" as great as desired, for any integer numbers, including for INT8. Nothing would prevent the use of an implicit scale factor of one billion, making the "dynamic range" of INT8 as 254 billion, or of an implicit scale factor of 10^100, resulting in a "dynamic range" of INT8 much larger than that of FP32.

torginus
1 replies
1d8h

I've long held the assumption that neurons in networks are just logic functions, where you can just write out their truth tables by taking all the combinations of their input activations and design an logic network that matches that 100% - thus 1-bit 'quantization' should be enough to perfectly recreate any neural network for inference.

amitport
0 replies
1d6h

1-bit 'quantization' is enough to create ANY function you'd like...

See also: Hadamard transform, Walsh functions.

danieldk
1 replies
1d12h

This is really an awesome introduction into quantization! One small comment about the GPTQ section:

It uses asymmetric quantization and does so layer by layer such that each layer is processed independently before continuing to the next

GPTQ also supports symmetric quantization and almost everyone uses it. The problem with GPTQ asymmetric quantization is that all popular implementations have a bug [1] where all zero/bias values of 0 are reset to 1 during packing (out of 16 possible biases in 4-bit quantization), leading to quite a large loss in quality. Interestingly, it seems that people initially observed that symmetric quantization worked better than asymmetric quantization (which is very counter-intuitive, but made GPTQ symmetric quantization far more popular) and only discovered later that it is due to a bug.

[1] https://notes.danieldk.eu/ML/Formats/GPTQ#Packing+integers

denali53
0 replies
1d2h

Agree - great intro! Could someone with much more knowledge point more to BitNet and other 1-bit models... seems like developments here could lead to a step change in small/local models? What is the theoretical limit to the power of such models?

woodson
0 replies
1d4h

It’s a shame that the article didn’t mention AWQ 4-bit quantization, which is quite widely supported in libraries and deployment tools (e.g. vLLM).

hazrmard
0 replies
1d

I've read the huggingface blog on quantization, and a plethora of papers such as `bitsandbytes`. This was an approachable agglomeration of a lot of activity in this space with just the right references at the end. Bookmarked!

dleeftink
0 replies
1d3h

What an awesome collection of visual mappings between process and output, immediately gripping, visually striking and thoughtfully laid out. I'd love to hear more about the process behind them, a hallmark in exploratory visualisation.

cheptsov
0 replies
21h46m

I wonder why AWQ is not mentioned. It’s pretty popular and I always was curious how it is different from GPTQ.