return to table of content

Towards 1-bit Machine Learning Models

londons_explore
37 replies
23h19m

I believe the future is 1 bit models - for both training and inference.

When people make custom silicon for 1 bit models, they'll find that it is sooooo much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.

That extra model size will vastly overshadow any worse performance of the models.

cma
12 replies
21h5m

For training how do you get any kind of meaningful derivative with it?

twelfthnight
8 replies
20h27m

Maybe evolutionary algorithms instead? Hasn't proven super useful historically, but maybe at the scale of enormous LLMs it will be?

VHRanger
3 replies
17h34m

Nope, they're orders of magnitude more inefficient because they don't leverage gradient descent.

Rule of thumb in optimization: real numbers are easy, integers are hard

markisus
2 replies
14h12m

This may be the status quo because of the so called "hardware lottery" which has historically been optimized for floating point. I'm speculating, but if hardware designers were instead only concerned about raw xnor density and throughput, we might end up with chips powerful enough that giant 1-bit nets could be trained purely through evolution.

imtringued
0 replies
8h1m

How do you optimize memory for floating point?

VHRanger
0 replies
6h27m

No, it's a fact at the mathematical level that you can enshrine in big O terms if you want to

bionhoward
2 replies
19h52m

Evolutionary algorithms made you, didn’t they?

bick_nyers
0 replies
3h2m

It took a lot of human brain flops to get to this point in time though, I wonder how many orders of magnitude more than it took to train ChatGPT...

TheDudeMan
0 replies
18h28m

That does not prove that they can beat gradient decent.

bick_nyers
0 replies
2h41m

Gradient-directed evolutionary algorithm sounds kinda interesting.

chalst
1 replies
20h12m

The OP explicitly excludes training.

cma
0 replies
18h0m

The one I replied to said 1-bit for both training and inference.

scotty79
0 replies
10h37m

Maybe something probabilistic?

mikewarot
5 replies
18h40m

I believe the future is 4*4 bit look up tables with output latches, with a bit to/from each Cartesian neighbor. Clock them like the colors of a chessboard, in 2 phases, and you don't have to worry about timing dependencies.

All of your code gets converted to a directed acyclic graph(DAG), executing at Ghz rates if you want.

Imagine a machine that can output a million parallel GPT-4 streams at 1000 tokens per second each.

If the architecture changes it's just a different DAG. Unlike with FPGAs and their custom blocks that have to be optimally used, you can compile a DAG almost instantly.

smusamashah
3 replies
14h14m

Is this something from a research finding or is it your idea?

mikewarot
2 replies
11h37m

It's my idea... I've been writing about it for over a decade. I hope to get a small version of it made via tiny tapeout later this year once I'm out of the precariat.

The idea is that you can decompose any non-cyclic code, for example, a matrix multiply, FFT, etc.. into a directed graph of bitwise operations, then map those operations out into the grid of LUTs. Pipe inputs in one side of the grid, and get the outputs out the other side, and all of the bitwise operations happen in lock step parallelism. (Unlike an FPGA which is asyncronous)

If you can decompose one stage of an LLM down into a bitwise graph of computation, you can easily map it into the grid. If there's a bad cell in the grid, you can map around it.

Because all of the high speed logic lines are short (to the neighbors) you don't have the long lines / high capacitance issues driving signals all the way across an FPGA or ASIC, thus you can really crank up the clock rate (and/or it's very power efficient).

It should be trivial, design wise, to scale from a 16x16 grid up through chips that could crank out Petaflops.

Here's a simulator for the thing, written in Pascal.

https://github.com/mikewarot/Bitgrid

sitkack
1 replies
10h38m

You can simulate this on a GPU now. This is like running a CA on a GPU with slightly different rules.

I think you would also be delighted to know that many of the non-GPU ai accelerators are in a mesh topology.

Cerebras, Tenstorrent, Esperanto

https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E...

mikewarot
0 replies
1h18m

I've been intrigued by this approach. On a more highly optimized (but harder to program) take is the GA144[1] from Chuck Moore, the inventor of Forth. It's a grid of 144 F18 Forth based processors in a cartesian grid. These processors are far more limited, but then again they take far less power as well.

[1] https://www.greenarraychips.com/

Dylan16807
0 replies
13h57m

1. If you write FPGA code as a grid of lookup tables then I would expect it to be easy to compile instantly.

2. In what way is this "acyclic"?

3. Won't putting your code into this form be the hard part? Even if you start with a DAG, 99.99% of them won't fit this form without intense restructuring. So you just moved the hard step over by one.

bionhoward
4 replies
19h55m

Isn’t 1bit too low for optimal radix economy (Euler’s number) though?

I want to see somebody try “imbalanced quaternary” -,0,+,2

twelfthnight
0 replies
19h27m

Haven't heard this argument before. But from the Wikipedia article it seems base 3 has the best asymptomatic radix economy, but isn't much better than base 2 and base 2 is seemingly easier to program and optimize.

Since this is a new argument I've not heard, would be curious if you had links or some more explanation.

johnmorrison
0 replies
17h29m

people are indeed working on -1,0,1,2 Q2 models, I read something about it the other day but don't recall the title.

they mentioned decomposition of Q2 into ternary+binary i.e. [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]

hervature
0 replies
8h49m

Radix economy is all about which base is the most efficient to represent a given number. It is simple to show that, for large numbers, this is equivalent to how efficient a base can represent itself, b/ln(b). Simple calculus shows this is minimized at e (Euler's number) or 3 if integer (closely followed by 2).

It sounds like you have something to add but you are already dictating the base by saying "bit". Literally from "binary digit". Anyway, quantization is not about which number system is best - virtually all computer systems we use today represents numbers in base 2. Quantization, at its core, is lossy compression. How do you go from a large model trained to high precision to a smaller model without hindering performance? This can be studied without needing to know the base.

Suppose you are using a decimal computer. You can ask yourself, I have a 128-decimal precision numbers, do I need that much precision? What happens if I just use 1-decimal precision by chopping off the 127 digits after the first decimal? You realize that there are two parts of an operation. The numbers involved (the operands) and the operation itself. You then ask yourself, if I keep one of the operands fixed (the original input), can I represent my 128-decimal precision neural network simply as a series of operations without the other operand? Perhaps only the most basic ones? Like: noops (add 0 or multiply by 1), increments (add 1), decrements (subtract 1), negations (multiply by -1), and clears (multiply by 0)? You count those numbers (-1, 0, and 1). There are 3 so you proudly proclaim you've made a neural network that only uses 0.477 dits. People get excited and confused because that is less than 1 dit which seems like a fundamental point. You further surprise the scientific field by finding a clever trick for getting rid of negations. You beat your previous record and now you only need 0.301 dits to represent your network. You are about to accept your Turing reward when the ghost of Claude Shannon appears and says "Why are you using a unit that measures entropy to mean how many symbols you have? If you insist, at least realize 0.301 dits is 1 bit." You are shocked when you realize 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are awestruck by Shannon's prescient comment "Change from the base a to base b merely requires multiplication by log_b(a).". You humbly give your award to Shannon. You keep the $1M since ghosts aren't as fast a new NVidia DGX. No matter how quantized the ghost is.

[1] - https://people.math.harvard.edu/~ctm/home/text/others/shanno...

Dylan16807
0 replies
13h55m

I bet the optimal "large" value is bigger than 2.

wongarsu
2 replies
20h32m

Probably more than 100x for inference. Not only are you drastically reducing the number of bits and replacing float math with integer math, you can do matrix multiplication with only addition (as pointed out in the BitNet b1.58 paper). Additions require a lot less hardware to implement than multiplication. Adding one-bit or two-bit numbers requires barely any hardware at all. A traditional two-bit adder without carry bit is three xor gates and an and gate.

fasa99
1 replies
9h57m

to me the most exciting thing is that if is training that is speed up on the order of 100x-1000x, a large cluster may be well suited to gradient-descend hyperparameter tuning parameters by LLM training again and again at scale -- this is the first foot in a door towards an AI that iteratively may improve itself

mobicham
0 replies
7h20m

LoRA training should benefit from the same speed-up, because the 1-bit weights will be frozen and all you need for both the forward and backward pass is a binary matmul, then maybe cast after to get more stable gradients.

api
2 replies
18h57m

Doesn’t training need higher precision to avoid getting stuck at local minima, at least with back propagation style learning?

Maybe something alternate like evolutionary algorithms could work in a domain like this, but so far those have proven to be less efficient.

sp332
1 replies
18h54m

A recent paper used a single ternary "trit" ~1.5 bits per parameter for training. https://news.ycombinator.com/item?id=39535800 They said it would be difficult to apply this technique to pre-trained models and had to be trained in 1-trit from scratch.

gliptic
0 replies
1h45m

But it's using larger weights during training, not just the trits.

JHonaker
2 replies
22h19m

That extra model size will vastly overshadow any worse performance of the models.

...What?

sroussey
1 replies
22h14m

I think OP was referring to parameter size. You can make up for quantization by having more parameters.

JHonaker
0 replies
5h3m

I know. I was being a bit mean, but they seem to be implying that more parameters even with worse performance is preferable. That seems backward to me. There is no reason for parameter size to be intrinsically valuable, and I think people are a bit infatuated with it.

imtringued
1 replies
8h17m

The 1 ternary bit models only compress the weights. You still add and subtract using bfloat16 for better accuracy. Dedicated silicon is mostly a waste, because you are only processing two operations per parameter during inference. Loading the parameters from slow DDR, GDDR or HBM memory is the bottleneck in practice and the only solution is PIM. I was honestly disappointed by Nvidia's Blackwell since it is just barely competitive with GDDR PIM.

mobicham
0 replies
7h17m

At least you can copy 16 times more data to the shared memory with binary weights.

programjames
0 replies
19h56m

I believe the future is 1 bit models - for both training and inference.

1 bit's nothin'. The future is training directly on electronic/photonic circuits.

bgnn
0 replies
10h42m

This! Maybe just integer, but not floating point. That's a ridiculous way to do computation when you don't really need the precision.

fabmilo
15 replies
1d1h

I like how it goes from 13.5 GB to 1.76 GB and gets comparable results. Definitely there is some underlying process of why this works so well that we are missing. Are the bits selecting the different subspaces? Can we improve this process by using better normalization layers / gating units?

nobodyandproud
13 replies
1d1h

I wonder what information theory can tell us here.

rapatel0
11 replies
1d

The entropy of English is ~1 bit per letter (Measured in a funny way by Shannon - https://cs.stanford.edu/people/eroberts/courses/soco/project...)

In general, it's a bit funny in how we build ML models. You take a bucket of matrices, fill them with random numbers, and then start to introduce "biases" through back propagation. A model can converge down loss, but most of them are still filled with random noise.

Binary weights are somewhat obvious in retrospect. Weights indicate the strength of an association between two neurons. Intuitively, most of the value is probably just that an association between two neurons exists, and whether it's positive or negatively associated.

Animats
7 replies
23h54m

I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes sense, but if you only have 0 and 1, the asymmetry ought to be a problem. Or is multiplying an XOR, so 1x1 = 0?

rapatel0
3 replies
19h44m

Yeah basically. In binary, multiplication is an XNOR operation.

00 = 1

01 = 0

10 = 0

11 = 1

zamadatix
1 replies
19h22m

0*0 = 1? I've always hear it as being the output of AND

overengineer
0 replies
7h40m

For balanced binary [-1, 1] yes. -1*-1=1

scarmig
0 replies
19h15m

XNOR does not distribute over AND or any other binary operators (try 0 XNOR (0 AND 1)), nor does it have a multiplicative identity, so it's not really multiplication in a way that's useful.

scarmig
0 replies
19h57m

It's more like, addition is an XOR. (In fact, AND as mult and XOR as add are GF(2)). Throw in NOT (or, really, just a constant 1) and you can compute any circuit.

Biologically, inhibitory neurons are every bit as important as excitatory ones, so if you squint just right, XOR looks like a neuron's activation being inhibited by another presynaptic neuron.

gpm
0 replies
19h18m

It seems like they have learned floating point parameters x and y so that dequantized(bit 0) = x and dequantized(bit 1) = y. Thus there is no built in asymmetry. Or more precisely they learned a zero point and a scale but it's equivalent to this simpler model in the 1 bit case.

It still seems like there would be a problem because either [x, y] looks like [0ish, 1ish] and you can't have negative weights, or [x, y] looks like [-1ish, 1ish] and you can't have "don't care" weights. But if you have some redundancy in your neurons I guess this is acceptable because you can just cancel out the positive contribution from a neuron you don't care about with a negative contribution from a very similar neuron that you also don't care about.

edflsafoiewq
0 replies
22h37m

There's a scale and bias afterwards, so its not necessarily asymmetric.

nobodyandproud
1 replies
18h59m

Down to one bit but that’s taking the 2.62 bits and then applying the redundancy factor.

What’s cool is that the differentiable activation function is important—-to avoid the linear, perceptron limitation—but the weight scaling can be so simple, at least for LLMs.

It makes me wonder whether the extra layers are effectively compensating; in other words, can the number of layers or hidden neurons be trimmed down if we then add more bits to each weight and still see equivalent effectiveness?

imtringued
0 replies
7h49m

You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887

mlsu
0 replies
22h25m

Wow, fascinating read. 1999

Would a monkey who knew the n-gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer.

No easy answer indeed!

ttul
0 replies
1d

Neural networks work, but why they work is not well known. Researchers continue to find new “free lunches” all the time.

convolvatron
0 replies
1d1h

idk that we need to think that hard. we have shown can build arithmetic functions from networks of logic gates given sufficient connectivity. so in some sense it just drops the learning network below the level of words and asked it to make its own math.

since this generalizability thing seems to be real, this kind of directly folds in quantization-aware-training, doesn't it?

vladf
13 replies
17h17m

Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.

danielhanchen
8 replies
15h9m

When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Dylan16807
5 replies
14h5m

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?

danielhanchen
4 replies
13h30m

Oh yes you need all the metadata, but because it's 2 numbers the scale and zero_point, I think the movement of singular digits are super fast to the GPU registers - cannot confirm though.

It's like in cuBLAS you do alphaAB + beta*C, and alpha and beta are both scalars which can be on the CPU, and moved to the GPU in nanaseconds.

I'm unsure though since I haven't tested it

Dylan16807
3 replies
12h36m

It still has to go through the entire memory system. It's hard for me to imagine that transferring a number from the CPU to the GPU is faster than transferring a byte, and if you have 2 CPU-resident numbers per GPU-resident byte that's a lot of transferring.

danielhanchen
2 replies
11h29m

I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement.

I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where

Dylan16807
1 replies
11h10m

My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer).

Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.

Is there actually something they gain from this particular method of splitting?

danielhanchen
0 replies
10h4m

I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx.

Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.

I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.

vladf
1 replies
14h55m

Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?

danielhanchen
0 replies
13h24m

Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly

mobicham
3 replies
7h50m

Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!

mikeravkine
1 replies
4h52m

Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.

mobicham
0 replies
4h30m

Thank you, very glad to hear that!

vladf
0 replies
2h43m

Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?

JKCalhoun
11 replies
1d1h

As an AI layman, I am for some reason excited by the low-bit learning models.

Apple, now leaving behind the albatros that was the Apple car, could be the one to roll their own silicon and put an LLM in our back pockets.

Goodbye Siri, hello Seriously.

talldayo
3 replies
1d1h

Or we could end up getting another Google integration, with iOS being too rigid and locked-down to feasibly support third-party private AI alternatives. Fate has a cruel sense of humor, sometimes.

JKCalhoun
0 replies
21h22m

Could be a way for Apple to buy time as well until they do have something competitive.

lawlessone
3 replies
1d

could be the one to roll their own silicon and put an LLM in our back pockets.

Wouldn't they have to do a lot of catching up with everyone else already working on it?

declaredapple
2 replies
1d

They've been designing their own chips a while now, including with an NPU.

Also because of their unified memory design, they actually have insane bandwidth which is incredibly useful for LLMs. IMO they may have a head-start in that respect for on-device inference of large models (e.g. 1B+ params).

talldayo
1 replies
19h34m

I don't think people are running 1B+ models on the Neural Engine these days. The high-performance models I've seen all rely on Metal Performance Shaders, which scales with how powerful your GPU is. It's not terribly slow on iPhone, but I think some people get the wrong idea and correlate an ambient processor like the Neural Engine with LLMs.

The bigger bottleneck seems like memory, to me. iPhones have traditionally skimped on RAM moreso than even cheap and midrange Android counterparts. I can imagine running an LLM in the background on my S10 - it's a bit harder to envision iOS swapping everything smoothly on a similarly-aged iPhone.

JKCalhoun
0 replies
14h30m

Sure, but we're discussing 1.8-bit models that, again I'm a layman, I assume are over an order of magnitude smaller in their memory overhead.

quentinp
2 replies
1d

Actually they plan to put an LLM in your back pocket using flash memory, not silicon: https://arxiv.org/abs/2312.11514

declaredapple
1 replies
1d

The flash doesn't do the computations though, that's just a method of getting it to the processor

sroussey
0 replies
22h6m

It would be better to have eeprom or some such directly attached as memory. No loading.

jstmm
8 replies
20h38m

In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time.

On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency.

It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models?

[0] https://arxiv.org/pdf/2402.17764.pdf

londons_explore
4 replies
19h56m

GPU's aren't really designed for 1 bit math... They don't perform much faster than floating point math.

Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.

imtringued
1 replies
7h58m

You're telling me GPUs aren't designed for additions and subtractions? Where did you hear that?

bick_nyers
0 replies
2h38m

I think they are moreso saying that GPUs are not optimized for those operations. CPU aren't "designed" for matrix multiplies yet we can still run them, albeit at a slower rate than on a GPU.

bee_rider
0 replies
19h44m

For 1 bit math, at least it should be possible to populate every other bit of an integer type, right? Surely one could do better with a dedicated type for this, but at least we could pack 16 single-bit weights into a 32 bit int for addition, right?

UncleOxidant
0 replies
19h34m

Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math. Pulling 1 or 2 bits out of a word isn't all that straightforward on CPU or GPU - lots of shifting and masking. I wonder how long it's going to be before we see custom hardware for bitnets? I suspect we'll see it on FPGAs first.

thatguysaguy
1 replies
20h30m

Sounds like these guys didn't use custom kernels, but BitNet did.

mobicham
0 replies
4h27m

That's correct. Only the dequantization is done on CUDA, the matmul is done with Pytorch. If they put their kernels open-source we could re-use them!

brucethemoose2
0 replies
19h20m

Real world GPU performance is hugely influenced by hand optimization of the CUDA kernels.

buildbot
0 replies
22h26m

Oh nice, good catch!

thechao
0 replies
23h41m

We're just speed-running the NN pendulum, at this point.

WithinReason
0 replies
10h24m

I don't think that paper's methods could be applied to LLMs

grungegun
6 replies
1d

Does anyone know if this works on vanilla deep networks? These quantization articles always seem to target LLM's which leads me to wonder if there's something special about the LLM architecture vs a vanilla deep architecture.

zaptrem
3 replies
21h25m

Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)

amelius
1 replies
19h50m

Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?

grungegun
0 replies
18h51m

So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.

alephxyz
0 replies
1d

LLMs have been trending towards obscenely large number of parameters (314B for grok), which makes quantization crucial if you want to run them without a Meta-sized budget.

Y_Y
0 replies
1d

Certainly does, people have been doing this in computer vision for years.

corysama
5 replies
1d2h

I look forward to the inevitable probabilistic sub-bit machine learning models :)

paul_mk1
1 replies
1d

Sub 1-bit has been done at least as far back as 2016 for VGG style networks (my work).

I was able to get 0.68 "effective" bits.

The idea is that in each forward pass you add noise to each weight independently drawn from normal distribution, and when you calculate snr, it's sub 1 bit. Points to the idea that a stochastic memory element can be used.

https://arxiv.org/abs/1606.01981

overengineer
0 replies
7h32m

If noise is desirable in this way, why not to just switch back to analog computing which comes with free noise

Cheer2171
1 replies
1d1h

My 0-bit model can predict if you have cancer with 99.5% accuracy. It doesn't even need input data! Don't ask about the false negative rate though.

asah
0 replies
1d

My 0-bit, no-input model can predict if you have cancer with 99.5% accuracy and 0.5% false negative rate. Don't ask about whether the cancer cell(s) are benign.

Cieric
0 replies
1d1h

I assume this is ment to be a joke, but isn't this actually being worked on? (minus the probabilistic portion) https://arxiv.org/abs/2310.16795

selecsosi
1 replies
18h39m

I think the problem is that you are also trying to represent the biases of directionality in the connection e.g.: [-1,0,1]. In the adjacency matrix, if you store the sign (direction) of the edge, then I think that would map fairly isomorphically.

ganzuul
0 replies
10h45m

Basically factorized into a giant state machine. Would be logical.

robrenaud
1 replies
23h50m

It would have predictible structure. There would be groupwise, regular sparsity. You could exploit these patterns to infer properties like, "there is no connection from this vertex to this entire span of verticies".

Just think about representing a compressed version of a matrix. You want to have a short encoding for matrices are ones that are more ordered in some real, exploitable way. With some determnistic computation of the compressed representation, you could produce the full adjacency matrix.

littlestymaar
3 replies
22h59m

Stupid question: how can you do deep learning on 1 bit? DL works by calculating a gradient and following it to minimize a loss function, but if all your parameters are 1 bit, there's no such thing as a gradient as you don't have differentiation at all. So how does it work?

edflsafoiewq
1 replies
22h42m

Not sure what you mean, why wouldn't there be a gradient? There is a problem that small changes to weights may have no effect (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result to 0 or 1 after every step), but to fix this I believe everyone maintains high precision weights for the backwards pass, from which the quantized weights are derived.

scotty79
0 replies
10h18m

You could treat fractional addition as probabilistic operation.

Centigonal
0 replies
22h40m

Not a stupid question at all.

Early quantization approaches just quantized a high-bit-precision pre-trained model - no need to calculate gradients on the quantized weights. BitNet[1] changed the game by training a low-precision model from scratch. It achieves this by keeping high precision in the gradients, optimizer state, and in "latent weights" which are then quantized on the fly. I don't really understand the finer details of how this works, so check out the paper if you're interested.

This article's approach is interesting. They start by quantizing a pre-trained high-precision model, and then they fine-tune the quantized model using LoRA (which dramatically improves the performance of the quantized model). They don't talk about the bit depth of the values in the LoRA matrices, so it may be that those are higher bit-depth.

[1] https://arxiv.org/pdf/2310.11453.pdf

iamgopal
3 replies
1d1h

So new AI chip will be coming optimised for large scale non-byte bit only AI world ?

thesz
2 replies
1d1h

Every CPU and GPU now has popcount [1] instruction. You can thank US three letter agencies for that [2].

  [1] https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
  [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
Multiplication then is just bitwise AND and a popcount.

ttul
1 replies
1d

Makes you wonder what the agencies know that we don’t…

rapatel0
0 replies
1d

It's just cryptography and encoding. Popcount is useful for understanding entropy.

redskyluan
2 replies
1d1h

what will be the next step of 1 bit? could it be 0.5 bit? Any kind of quantization is not gonna to last long.

pixelpoet
0 replies
1d

The way things are going we'll have -1 bit before long :)

CamperBob2
0 replies
1d1h

Not using traditional numbers at all, I imagine. It's possible that trying to build neural nets out of vectors that are themselves built with quantized scalars is a suboptimal approach.

mmoskal
2 replies
1d

It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.

Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).

jimmySixDOF
0 replies
15h11m

Which is itself a little counterintuitive as the arxiv papers they cite say models need to be pretrained from the ground up using 1- or 2-bit (or 1.58bit). It definitely adds some interesting data points for the open source community who are experimenting in every possible direction.

andy_xor_andrew
0 replies
21h17m

Interesting, and it kinda makes sense. You quantize, which invariably means you lose some precision, but then you can finetune post-quantization to recover at least some of it. Neat idea.

kromem
2 replies
21h21m

The most exciting part about ternary or binary weights is the inevitable hardware revolution for AI dedicated chips that's going to result from it.

imtringued
0 replies
7h55m

Your hardware already supports addition and subtraction and the tensor cores of NVIDIA GPUs are already fast enough to keep up. The only benefit is reducing memory capacity and bandwidth requirements.

Matumio
0 replies
4h54m

You mean we'll have hardware accelerated ternary instructions? https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

(Okay probably those are not ready to be used as NN weights if the activations are not binary too, but... the gap to what CPUs already can do is getting smaller.)

ianbicking
2 replies
1d

Reduction to decision tree!

But I'm unclear how it actually run, and the article talks about the conversion and training but doesn't describe how it runs... I suppose because it's obvious to someone who has followed quantization.

Thinking out loud... if you have a model of just 1 and 0, my first thought is that the outputs are 1's and 0's but I think that's wrong. Instead it's a bunch of floats, and you multiply them by 1 or 0 (in a sense you are sampling the output of the previous layer?), add them up, and put the result through some activation function. And two-bit quantization sounds kind of similar, just with a _little_ scale, going from -1 to 2 instead of 0 to 1.

It seems kind of interesting that you now have a bunch of weights that are exactly 0, meaning you can assert something about what parameters and weights affect what parts of the output. Though in some sense the way they compress the weights down to one bit is also a heuristic you could use to interpret the original model... this just makes it clearer that in totality you are making a defensible simplification, because the end result is still a working model.

It also seems like you could make a lot of mathematical assertions about a one bit model that would be harder to make otherwise. Like maybe you could start thinking of a model as an equation, a _particular_ (though giant) equation, and look at its properties and consider symbolic transformations to that equation.

fabmilo
0 replies
22h12m

I like the decision tree analogy

bick_nyers
0 replies
2h35m

A comment I really liked on a previous post about ternary highlighted that what you are actually measuring with {-1, 0, 1} is inverse correlation, no correlation, and correlation.

r1chardnl
1 replies
23h50m

Is this similar to binary search or could it be a binary search where you're just excluding all possibilities that you know can't be the probable result that you're looking for?

nyrikki
0 replies
22h29m

Feed forward networks are effectively DAGs

With the smaller model size, quantized models have less accuracy and less stable training, but large models take advantage of increased parallelism.

Binary search is more linear, circuits are a better lens than turing machines.

While it still needs to be verified, if you look into what a uniform consent depth threshold circuit is, that will help.

Although I guess binary weights may be in AC0 and not TC0, but that may not hold with billions of parameters.

codedokode
1 replies
4h6m

I was thinking about this idea (using only additions for inference) before too. The article says about using 1-bit values for weights, but there is also another option: use integers for weights, but allow the output of a neuron take only 0 and 1 (for example, 1 if output is > 0 and 0 otherwise). In this case we do not need multiplications as well, but cannot save memory for storing weights. Maybe it has some other advantages, who knows. Never had time to research this further.

Greenpants
0 replies
3h50m

You may be interested in the "binary step" activation function. This does what you're suggesting. In general, complex behaviour really takes a hit though using this for the activation function of a neuron (though I'm also not sure which papers show metrics on this being used for transformer models).

shrubble
0 replies
13h23m

I wonder if there is a correspondence to Sony's 1-bit DSD audio stream? Could hardware developed for this purpose be used for some part of the processing?

danielhanchen
0 replies
15h33m

Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and with no activation aware tuning calibration part - ie no more 3 hour calibration runs! A big fan of the 4bit ones especially, and the 3,2, and now 1bit ones!

Also super cool idea of 1bit needing some calibration like AWQ - no calibration data shows very bad results, but with some LoRA adapters and finetuning, a great recovery of performance is possible.

Planning to add support for this inside Unsloth to make all low bit finetuning 2x faster and save tonnes of VRAM!