return to table of content

The Era of 1-bit LLMs: ternary parameters for cost-effective computing

cs702
119 replies
1d4h

There are two findings I find shocking in this work:

* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

If the proposed methods are implemented in hardware, we will see even greater gains in compute and memory efficiency.

Wow.

vessenes
26 replies
23h49m

I'd be VERRY cautious about being excited here.

My priors are like this:

1. Initial training of a neural network moves all weights around a large amount at first.

2. Later training of the network adjusts them a small amount.

3. An undertrained network will therefore look a lot like figuring out "positive, negative, or 0?" for each node during early training.

If all these things are true, then

1. Early training of an fp16 network and a bitnet with 0 added will be roughly similar in results

2. Later training will yield different / worse results, as the network gets into the 'fine tuning' part of the training.

I think the paper's stats back these priors up -- they say "this works on (3B+) large networks, but not small ones." They then imply there's something about the structure of a large network that allows a bitnet to do well. It seems more likely to me it works on large networks because they have not put the compute into 3B+ networks to get past the 'gross tuning' phase.

The networks they have compute to put in to get them 'fully' trained -- those networks don't show the results.

Also, a quick reminder that Perplexity 12 is really terrible. You would not want to use such a network. Hopefully I'm wrong and we can get something for free here! But, I'm cautious - to - skeptical.

vessenes
9 replies
21h44m

Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.

pclmulqdq
8 replies
18h47m

I was pretty unhappy and suspicious for the same reason. Not reporting perplexity for a 70B network while reporting its efficiency means that someone did something and the result wasn't good enough to put in the paper.

GaggiX
6 replies
18h40m

According to the author, the 70B model is not fully trained.

pclmulqdq
5 replies
17h19m

"Is not fully trained" can also mean "we did not figure out how to reach an acceptable loss" or "training was unstable," both of which are common for ML systems.

GaggiX
4 replies
16h16m

It probably means that the model is not fully trained, because it is very expensive to train a 70B model, not even Mamba or RWKV have a model that comes close to that size, the leeriness is just kinda silly honestly.

bick_nyers
3 replies
4h45m

Extraordinary claims require extraordinary evidence.

That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.

edflsafoiewq
1 replies
4h6m

results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?

pclmulqdq
0 replies
3h47m

Most research universities have the resources to train a ~10B parameter model, at least.

GaggiX
0 replies
4h33m

For sure bigger models are needed to compete with transformer LLM, same thing for Mamba, I was just bothered by the distrust about something very reasonable like not being able to fully train a 70B model.

kristjansson
0 replies
1h46m

One can forgive the lack of quality results for the 70B model, but apparently they trained 7B and 13B versions of their model, and don't report those either.

mise_en_place
8 replies
23h38m

Intuitively I've always been a bit skeptical of quantization. Wouldn't there be a tiny loss in precision by doing this type of quantization? I could imagine the error function increasing by utilizing these types of techniques.

thesz
4 replies
19h32m

John Carmack pointed out (and I learned it here at HN) that what training really needs is the *sign" of each individual gradient parameter. I.e., you can quantize gradient to -1, 0 and 1 and still have neural network learn much of the dataset.

farhanhubble
1 replies
16h4m

Wow! Is there a link to read up more on this?

Solvency
1 replies
17h51m

Why isn't John Carmack working for OpenAI? Hell, why did he waste years at Meta to work on a VR headset and NOT AI? He even announced he wants to focus on AGI but he missed out on literally all the action.

spencerchubb
0 replies
20h37m

Yes each weight will not be able to "learn" as much if it has less bits of precision. But the idea is that you can use more weights, and the big question is whether these low-precision weights can make the model more accurate, as a whole.

int_19h
0 replies
22h27m

Quantization does reduce quality of the outputs. But the point is that you save enough memory doing so that you can cram a larger model into the same hardware, and this more than compensates for lost precision.

eightysixfour
0 replies
22h31m

It does increase the “error” (meaning it is less likely to predict the next word when compared against a dataset) but the losses are lower than your intuition would guide you to believe.

cs702
2 replies
21h50m

Thank you. Your key point -- that so far all models with the proposed methods may have been only "grossly trained" -- is compelling. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That seems sensible to me, and makes replication easier, but I agree we need more to see extensive testing, after more extensive pretraining, on models of larger sizes.

gliptic
1 replies
21h34m

They also trained 3B with 2 trillion tokens.

The number of training tokens is a crucial factor for LLMs. To test the scalability of BitNet b1.58 in terms of tokens, we trained a BitNet b1.58 model with 2T tokens following the data recipe of StableLM-3B [ TBMR], which is the state-of-the-art open-source 3B model.

[..]

Our findings shows that BitNet b1.58 achieves a superior performance on all end tasks, indicating that 1.58-bit LLMs also have strong generalization capabilities.
cs702
0 replies
20h27m

You're right. Thank you for pointing that out!

svantana
1 replies
22h27m

Wait, are we reading the same paper? What I'm seeing is comparable accuracy to unquantized models for <4B params, and nothing reported for larger models except resource consumption.

vessenes
0 replies
21h45m

Nope, you're right, I got the table inverted in my head. I'm updating my top comment.

gradascent
0 replies
18h59m

Then perhaps a method emerges out of this to make training faster (but not inference) - do early training on highly quantized (even ternary) weights, and then swap out the weights for fp16 or something and fine-tune? Might save $$$ in training large models.

gliptic
0 replies
22h18m

Also, a quick reminder that Perplexity 12 is really terrible.

The 3B model had a perplexity of 9.91, less than LLaMa 1 in fp16.

creshal
20 replies
1d3h

* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

Easier said than done, of course, and very impressive that they pulled it off.

In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

cs702
9 replies
1d3h

> Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

I find it shocking that we don't even need lower floating-point precision. We don't need precision at all. We only need three symbols to represent every value.

> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

I find it shocking. Consider that associative addition over ternary digits, or trits, represented by three symbols (a,b,c) has only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter), and only three possible outputs, a, b, or c. Matrix multiplications could be executed via crazy-cheap tritwise operations in hardware. Maybe ternary hardware[a] will become a thing in AI?

---

[a] https://en.wikipedia.org/wiki/Ternary_computer

jerf
3 replies
1d2h

An integer is just a concatenation of bits. Floating point appears more complicated but from an information theory perspective it is also just a concatenation of bits. If, for the sake of argument, one replaced a 64-bit int with 64 individual bits, that's really the same amount of information and a structure could hypothetically then either choose to recreate the original 64-bit int, or use the 64-bits more efficiently by choosing from the much larger set of possibilities of ways to use such resources.

Trits are helpful for neural nets, though, since they really love signs and they need a 0.

So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

This may seem an absolutely bizarre zigzag, but I am reminded of Busy Beavers, because of the way they take very the very small primitives of a Turing Machine, break it down to the smallest pieces, then combine them in ways that almost immediately cease to be humanly comprehensible. Completely different selection mechanism for what appears, but it turns out Turing Machine states can do a lot "more" than you might think simply by looking at human-designed TMs. We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

eru
0 replies
16h28m

We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

Yes. Though here the interesting point is not so much that these structures exist, but that 'stupid' back-propagation is smart enough to find them.

You can't find busy beavers like that.

cs702
0 replies
1d1h

> So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

Thank you. I find many other things interesting here, including the potential implications for hardware, but otherwise, yes, I agree with you, that is interesting.

SkyBelow
0 replies
22h10m

This sort of breakdown also reminds me of the explanation of why busy beavers grow faster than anything humans can ever define. Anything a human can define is a finite number of steps that can be represented by some turing machine of size M. A turning machine of size N > M can then use M as a subset of it, growing faster than than the turing machine of size M. Either it is the busy beaver for size N, or it grows slower than the busy beaver for size N. Either way, the busy beaver for size N grows faster than whatever the human defined that was captured by the turning machine of size M. This explanation was what helped me understand why busy beavers is faster growing than any operator that can be formally defined (obviously you can define an operator that references busy beaver itself, but busy beaver can be considered to not be formally defined, and thus any operator defined used it isn't formally defined either).

The bit about floating point numbers just being a collection of bits interpreted in a certain way helps make sense why a bigger model doesn't need floating points at all.

jxy
2 replies
1d2h

The matrices (weights) are ternary.

The vectors are not.

cs702
1 replies
1d2h

The activations are in (-1, 1), so they're also representable by (-1, 0, 1).

rfoo
0 replies
10h16m

This is wrong. The paper described that their activation is in int8 during inference.

That being said, before-LLM-era deep learning already had low bit quantization down to 1w2f [0] working back in 2016 [1]. So it's certainly possible it would work for LLM too.

[0] 1-bit weights, 2-bit activations; though practically people deployed 2w4f instead. [1] https://arxiv.org/abs/1606.06160

p1esk
0 replies
16h45m

If you find three symbols per weight shocking, this paper should completely blow your mind: https://arxiv.org/abs/1803.03764

I admit it did shock me when it came out.

cs702
0 replies
3h21m

EDIT: Embarrassingly, on the last paragraph I got the number of possible input pairs wrong:

> only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter)

The correct number, ignoring order, is six pairs, because we have to include (a,a), (b,b), and (c,c).

ncruces
4 replies
23h37m

Well I guess it's the “blowing up parameter count to make up for it” that confuses me, but maybe it's just ignorance.

Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?

I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?

int_19h
3 replies
22h25m

With existing common quantization techniques, a 70b model quantized to 3-bit still drastically outperforms an unquantized 35b model.

p1esk
2 replies
16h47m

Are you sure? I was under impression that 3b quantization still results in a significant degradation. Which quantization method are you talking about?

int_19h
1 replies
14h46m

It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:

https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...

p1esk
0 replies
13h30m

Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.

satellite2
2 replies
1d3h

Because it's no longer a linear optimization or curve fitting problem. It becomes a voting or combinatorial problem. Which at least in my mind are two completely different areas of research.

HPsquared
1 replies
1d2h

With enough parameters, it probably starts looking continuous again. Like how in physics everything is quantised at the smallest scale but if you put enough atoms together it all smooths out and behaves "classically".

amelius
0 replies
21h6m

Yes, but we can simulate classical physics using mathematical shortcuts. Simulating every little atom would take a lot more work.

gemeral
0 replies
11h45m

and blowing up parameter count to make up for it

based on (an admittedly rapid and indulgent reading of the paper), it seems like they're not increasing the parameter size. Do you mind pointing out where the blowup is occurring?

SuchAnonMuchWow
0 replies
6h19m

I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

No, unless I'm mistaken it's a huge impact: it means the matrix product is separable: basically, it's a O(n²) algorithm, and not O(n3): add together all the c_j = sum(a_i_j), d_i = sum(b_i_j), and the final results are all the combinations of cj+di. And even then, half that is unnecessary because the d_i can all be pre-computed when before inference since they are weights.

But I skimmed over the paper, and didn't found the part where it was explained how they replace the product by additions: from what I understand, they remplace multiplications by bi by selecting +ai, 0, or -ai. So the final matrix multiplication can be implemented by only additions, but only because the weights are 1,0,-1 they avoid multiplications altogether. This is really different from what the GP said (remplacing a0*b0+... by a0+b0+...).

paul_mk1
14 replies
19h37m

Fun to see ternary weights making a comeback. This was hot back in 2016 with BinaryConnect and TrueNorth chip from IBM research (disclosure, I was one of the lead chip architects there).

Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).

Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.

https://arxiv.org/abs/1606.01981

This was a passion project of mine in my last few months at IBM research :).

I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

mjcohen
3 replies
13h34m

IIRC, Hamming's book "Digital Filters" (1989) has a section on FFTs with only the sign of the coefficient being used. It performed surprisingly well.

mitthrowaway2
1 replies
10h49m

What is the sign of a complex number? Do you mean the phase?

nine_k
0 replies
3h40m

AFAICT, both the real and imaginary components are from (-1, 0, +1) only. No single sign, but only 8 directions and the center.

thomasahle
0 replies
12h4m

You mean Fast Hadamard Transform?

fabmilo
1 replies
15h13m

They train using Straight Through Estimator but is cited in the previous BitNet paper. What happen to the TrueNorth Chip? I think investing in specialized hardware for AI is a good bet.

paul_mk1
0 replies
13h54m

Nice to know there is a trail to relevant citations. I missed the BitNet paper and need to catch up.

Btw TrueNorth project evolved into "NorthPole" chip by the same group, and was recently in the press. From afar NorthPole looks like an interesting design point and leverages on-chip memory (SRAM)--so it's targeting speed and efficiency at the expense of memory density (so perhaps like Groq in some respects). Tbh I haven't followed the field closely after leaving the group.

WiSaGaN
1 replies
3h22m

Could the reason that 3 states in this case be more efficient than 2 states be that 3 is closer to 2.718... (Euler's number) than 2 is?

altruios
0 replies
2h55m

Why not have some layers/nodes/systems be 2 states and have others be 3... couldn't you get arbitrarily close to Euler's number that way?

WhitneyLand
1 replies
17h29m

That’s really interesting to see the breadcrumb trail goes back that far.

So what are the most important insights in this paper compared to what was previously done?

I assume there’s more context to the story and it’s not just that no one thought to apply the concepts to LLM’s until now?

paul_mk1
0 replies
13h45m

I don't think there is anything conceptually new in this work, other than it is applied to LLMs.

But in fairness, getting these techniques to work at scale is no small feat. In my experience quantization aware training at these low bit depths was always finicky and required a very careful hand. I'd be interested to know if it has become easier to do, now that there are so many more parameters in LLMs.

In any case full kudos to the authors and I'm glad to see people continuing this work.

nxobject
0 replies
18h4m

As aside, I'm curious: what was it like to work at IBM research, especially as a legacy industrial research org?

eru
0 replies
16h44m

You can probably apply the same techniques 'Deep neural networks are robust to weight binarization and other non-linear distortions' used to get to 0.68 bits / weight to get your ternary weights below one bit; so you can claim they are still one-bit networks.

cs702
0 replies
4h6m

Thank you. Others on this thread have addressed the citation-trail issues you raise. I just want to tell you how helpful I find your comment about why ternary weights ought to work at all without degrading performance:

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

Your guess sounds and feels right to me, even if currently there's no way to express it formally, with the rigor it deserves.

Thank you again for your comment!

antimatter15
0 replies
15h15m

They cite straight through estimators in the previous work with many of the same authors on (actual binary) BitNet

nutanc
13 replies
1d3h

We have been experimenting with the paper(https://www.researchgate.net/publication/372834606_ON_NON-IT...).

There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.

The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)

m3kw9
4 replies
23h47m

How is this not lossy compression?

sandyarmstrong
1 replies
23h24m

LLMs and vector embeddings are always lossy compression, yes?

eru
0 replies
16h32m

Almost always. Though you can use them in a lossless compression system, too, with a few tricks.

rf15
0 replies
23h27m

It kind of is!

cs702
3 replies
1d1h

You're talking about mapping floating-point vector representations, i.e., embeddings, computed by a pretrained LLM to binary vector representations, right? And you're talking about doing this by first having someone else's pretrained LLM compute the embeddings, right? Sorry, but that seems only minimally, tangentially related to the topic of running LLMs in ternary space. I don't see how your comment is relevant to the discussion here.

nutanc
2 replies
1d1h

Yeah, sorry, needed a much bigger canvas than a comment to explain. Let me try again. The example I took was to show mapping from one space to another space and it may have just come across as not learning anything. Yes. You are right it was someone else's pretrained LLM. But this new space learnt the latent representations of the original embedding space. Now, instead of the original embedding space it could also have been some image representation or some audio representation. Even neural networks take input in X space and learn a representation in Y space. The paper shows that any layer of a neural network can in fact be replaced with a set of planes and we can represent a space using those planes and that those planes can be created in a non iterative way. Not sure if I am being clear, but have written a small blog post to show for MNIST how an NN creates the planes(https://gpt3experiments.substack.com/p/understanding-neural-...). Will write more on how once these planes are written, how we can use a bit representation instead of floating point values to get similar accuracy in prediction and next how we can draw those planes without the iterative training process.

pests
1 replies
18h57m

how we can draw those planes without the iterative training process.

Sounds interesting, but this is the part I would need more explanation on.

Just started reading your linked blog, I see it goes into some details there.

nutanc
0 replies
14h40m

Will add a lot more details next week. Have been postponing it for a long time.

fabmilo
2 replies
20h59m

I find this extremely interesting. Do you share the source code of the process? any more references?

nutanc
1 replies
14h36m

Unfortunately the source code is currently not open sourced. Some more details at (https://www.researchgate.net/publication/370980395_A_NEURAL_...), the source code is built on top of this.

The approach is used to solve other problems and papers have been published under https://www.researchgate.net/profile/K-Eswaran

We are currently trying a build a full fledged LLM using just this approach(no LLM training etc) and also an ASR. We should have something to share in a couple of months.

licnep
0 replies
6h34m

Am I missing something or is this just a linear transformation?

It says here ( https://www.researchgate.net/publication/370980395_A_NEURAL_... ) that each layer can be represented as a matrix multiplication (equation 3): Ax = s

So concatenating multiple layers could just be reduced to a single matrix multiplication?

If there is no non-linearity I don't see how this could replace neural networks, or am I missing something?

SushiHippie
0 replies
23h56m

Wow, this works better than I would've thought.

Who moderates Hacker News?

First result:

Hacker News

At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by Daniel Gackle who posts under the username "dang".
Noe2097
5 replies
20h9m

There is another _shocking_ realization in this work: there are 11 types of people: those who know what binary means, those who don't, and those who say they do but actually don't.

"The era of 1-bit LLMs"

Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry -- and sad, please let's all get back to something vaguely sound and rigorous.

esrauch
1 replies
16h33m

One trit but that's not a word anyone knows.

baq
0 replies
12h8m

That used to be true yesterday…

npunt
0 replies
20h5m

Ternary supporters are always bitter about this

(I'll let myself out)

hk__2
0 replies
1h18m

please let's all get back to something vaguely sound and rigorous

Something rigorous would be to actually read the paper rather than stop at the first part of its title. The authors are not claiming their LLM is 1-bit.

gpderetta
0 replies
2h21m

There are 10 types of people, those who don't know binary, those who do and those who know ternary.

phkahler
3 replies
3h47m

> we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

Thinking out loud here. If you encode 64 weights in 2 64-bit words you can have the bits in one word indicating +1 if they're 1, and the bits in the other word indicating -1 if they are 1. You should be able to do the "products" with a few boolean operations on these 2 words to get a pair of 64 bit words for the result. Then summing becomes a matter of using a count-of-1's instruction on each word and subtracting the "negative" count from the positive. If AVX instructions can do this too, it seems like equivalent of 10-100 TOPS might be possible on a multi-core CPU.

cs702
2 replies
3h25m

Yes. More generally, this will enable implementation via crazy-cheap bit-wise ops in binary hardware, and possibly, maybe, via crazy-cheap trit-wise ops in ternary hardware that manipulates ternary digits, or trits. Note that any binary op over trits has only nine possible (trit, trit) input pairs and only three possible trit outputs. Maybe ternary hardware for AI will become a thing?

phkahler
1 replies
3h5m

Fleshing out my thought above. If we want to multiply A*B = C and all operands are stored in 2 separate bits Ap and An (Ap = 1 if A = +1 while An = 1 if A = -1). We can do a product with:

Cp = (Ap & Bp) | (An & Bn)

Cn = (An & Bp) | (Ap & Bn)

So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.

The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

cs702
0 replies
3h1m

> I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

Yes, I agree. This still needs to be more extensively tested.

lr1970
3 replies
1d2h

Authors reported perplexity only for small up to 3B weights models. On the other hand, they reported throughput for 70B model, but not its performance (perplexity, end-to-end tasks). Very unfortunate omission. Overall, the paper is rather poorly written.

cs702
2 replies
1d2h

If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, at larger model sizes, is still necessary.

lr1970
1 replies
1d1h

towards the end of the paper they mentioned training on 2T tokens.

cs702
0 replies
21h19m

You're right. Thank you for pointing that out.

jandrese
3 replies
1d2h

It seems like the AI space is slowly coming back around to the old Thinking Machines CM-1 architecture. It's not too often in computing where you see ideas a full 40 years ahead of their time make it into production.

theendisney
1 replies
15h15m

Memristors any moment now

TMWNN
0 replies
3h37m

I'm holding out for Josephson junctions

giantrobot
0 replies
19h28m

IIUC the main issue with the CM-1 architecture was feeding the processor cluster with data. That required a heftier front end system than was practical/affordable at the time. With modern CPUs and memory subsystems the GPUs can be saturated pretty easily. So going back to huge clusters of super narrow cores won't starve them for work.

fzliu
2 replies
19h51m

This will be big for FPGAs - adders are extremely cheap compared to multipliers and other DSP blocks.

eru
1 replies
16h27m

Multipliers for eg 8 bit or 4 bit floating point values should also be pretty cheap? (I assume multipliers have a cost that grows quadratically with the number of bits?)

imtringued
0 replies
3h50m

You use DSPs for that. Effinix has direct bfloat16 support in their FPGAs. The real game changer is using the carry chain with your LUT based adders. Assuming 16 LUTs, you could be getting 11 teraops out of a Ti180 using a few watts. Of course that is just a theoretical number though but I could imagine using four FPGAs for speech recognition and synthesis and vision based LLMs operating in real time.

verytrivial
1 replies
16h49m

If the proposed methods are implemented in hardware

.. And the paper is _true_ of course, indeed, this sort of compounding quantum leap in efficiency due to representational change starts to get towards the Black Mirror / SciFi foundational mythology level of acceleration. Wild (if true!)

eru
0 replies
16h26m

Slight tangent: in physics a quantum leap is the smallest possible change.

sva_
1 replies
20h47m

I'm also curious about the potential speed gains in automatic differentiation, as there are way less branches to 'go up'. Or am I wrong here?

lumost
0 replies
20h44m

They actually use a relu to represent the model weights. But I'm not convinced that this can't be avoided. We do gradient boosted decision tree training without this trick.

rhaps0dy
1 replies
1d3h

I think you need more evidence than this paper (which is very short and light on actual numbers) to be this shocked.

For example, most of the plots in the paper are actually of throughput, memory, etc. all performance characteristics that are better on the ternary version. Which, of course.

The only thing that contains perplexities are Table 1 and 2. There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA LLM in various sizes" on the RedPajama data set. The first thing to note is the perplexities are very high: they're all at least ~9.9, which compared for example with quantized Llama on wikitext-2 which is 6.15 (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-llamacp...). Maybe RedPajama is a lot harder than wikitext-2, but that's a big gap.

I think probably their benchmark (their "reproduced FP16 LLaMA LLM") is just not very good. They didn't invest much in training their baseline and so they handily beat it.

cs702
0 replies
1d3h

Thank you. I think the paper as it is provides enough evidence to support the claims. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, is still necessary.

flockonus
1 replies
22h2m

Considering how much faster additions are processed, and how a particular silicon chip could be optimized for this very specific case; all parts added together perhaps could show >100x speed up vs current systems.

I must concur, "wow".

Nevermark
0 replies
15h42m

For hardware, 2-argument ternary additions and multiplications should be very close in terms of the tiny circuit required for either.

If you are doing ternary calculations on 32/16-bit hardware, then the additions would be simpler.

abeppu
1 replies
1d1h

On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

Did they actually show absence of performance degradation?

I think it's conspicuous that Table 1 and Table 2 in the paper, which show perplexity and accuracy results respectively, are only for small model sizes, whereas Figure 2, Figure 3 (latency, memory, energy consumption) and Table 3 (throughput) all show larger model sizes. So it seems like they had every opportunity to show the perplexity/accuracy comparisons at the larger model sizes, but did not include them.

tbalsam
0 replies
1h54m

It's not too surprising, honestly! I've poked around with similar in the past and am of a perspective that ternary is a very good thing for a lot of neural networks.

Training CIFAR-10 speedily w/ ternary weights on an fp16 interface (using fp16 buffers, and norm params unchanged): https://gist.github.com/tysam-code/a43c0fab332e50163b74141bc...

p1esk
0 replies
20h34m

Ternary networks have been used since 2015. There are hundreds of papers. They all require full QAT (training from scratch). Not sure why you’re shocked.

jsnelgro
0 replies
15m

It all seems too good to be true but your comment helped me develop a mental model for how this could work.

The most inspiring aspect to me here is just realizing how much potential low-hanging fruit there is in this space! What other seemingly naïve optimizations are there to try out?

fragmede
0 replies
18h15m

In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

does that mean we can do integer instead of floating point math for some parts of the training? that seems like a really big win

chrsw
0 replies
2h41m

It almost seems too good to be true

bjornsing
0 replies
11h10m

* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

Aren’t you over complicating it a bit here? A dot product between a vector of activations (a₁, a₂, …) and a vector of ternary weights (b₁, b₂, …) can of course be computed as the sum of all activations for which the weight is 1, minus the sum of all activations for which the weight is -1.

It can’t however be computed as (a₁+b₁ + a₂+b₂ ...). You must have gotten that wrong.

beagle3
0 replies
1d2h

I haven’t been keeping tabs, but this seems very much like RIP / Achilioptas version of the Johnson Lindenstrauss lemma.

Perhaps the rest of the JL lemma promise applies as well - compressing the number of parameters by a few orders of magnitude as well.

api
0 replies
3h51m

Question is whether you can train in this domain or whether you need increased precision to properly represent gradients.

If we could train in this domain it would be an even bigger game changer.

acchow
0 replies
20h33m

Conversely, this also implies our current model sizes can still embed a ton more “understanding”

PaulHoule
0 replies
20h41m

I am not startled at all. Dense vector representations are pretty silly, they can’t really be the road to knowledge representation.

AaronFriel
0 replies
20h4m

In undergrad, some of us math majors would joke that there's really only three quantities: 0, 1, infinity.

So, do we need the -1, and/or would a 2.32 bit (5 state, or 6 with +/-0) LLM perform better than a 1.58 bit LLM?

lucubratory
47 replies
1d6h

After reading the results I skipped back to the comment section to ask if this was real because it looks a little too good to be true, but figured I should check authors and it's Microsoft research and UCAS so yeah, real. This is going to change a lot of things, obviously the edge computing applications they point out, but also this is going to bottom out the cost of providing high-performance LLMs in the cloud. I don't know what that means for the economics long term, naively way less costs maybe means new entrants without an entire cloud available can compete easier? I do wonder if something like this has already been found and implemented by either OpenAI or Google.

aurareturn
35 replies
1d5h

After playing with OpenAI's GPT4 API, I'm quite convinced that LLMs would be in everything and everywhere today if inference cost is as low as loading a website and context size is 100x higher.

In other words, only inference cost is holding it back from completely changing everything.

So if we have a shortcut to getting something like GPT4 to run locally on a small device, watch out.

jart
21 replies
1d5h

LLMs will give normal people a firmer standing in technological society. That's a good thing. But will it change everything? Not a chance. Even if LLMs did change everything, that probably would not be a good thing. Dijkstra says Muslim algebra died when it returned to the rhetoric style, and the modern civilized world could only emerge —for better or for worse— when Western Europe could free itself from the fetters of medieval scholasticism —a vain attempt at verbal precision!—thanks to the carefully, or at least consciously designed formal symbolisms that we owe to people like Vieta, Descartes, Leibniz, and (later) Boole. So don't be so proud of these graphics cards you've made, because the ability to understand the human tongue is insignificant compared to the power of math.

rafaelero
16 replies
1d4h

LLM's can do math as well.

dns_snek
10 replies
1d4h

Last time I checked, GPT-4 couldn't reliably add 2 numbers, never mind anything more complex.

vidarh
8 replies
1d3h

Last I checked (and confirmed by repeating it just now) GPT-4 did just fine at adding 2 numbers up, because it knows better now than to do that manually and will express it as Python. It does worse if you try to force it to do it step by step like a child and don't reinforce adherence to the rules every step, because just like humans it gets "sloppy" when you try to get it to repeat the same steps over and over.

If you want to measure its ability to do mindlessly repetitive tasks without diverging from instructions, you should compare it to humans doing the same, not expect it to act like a calculator.

If you want to measure its ability to solve problems that involve many such steps that are simple to express but tedious to carry out, ask it to write and evaluate code to do it instead.

imtringued
3 replies
1d2h

You do realize that arithmetic is a very simple symbolic manipulation task? All you have to do is keep track of the carry. I haven't seen an LLM that couldn't get digit by digit addition done, but they always mess up the carry.

vidarh
2 replies
23h43m

Just like humans. Try to get regular people do e.g. add 15-16 digit numbers (where is typically where I'd see GPT4 start to get "sloppy" unless you prompt it the way you would a child who's learning and is still prone to get annoyed and wonder why the hell you make them to it manually), and see how many start making mistakes.

I find it really comical that this is what people complain about GPT over - there's zero benefit to get LLMs to get good at this over other tasks. To the extent we get it "for free" as a benefit of other learning, sure, but when we make kids practice this over and over again to drill doing it without getting sloppy, it has traditionally been out of some belief that it's important, but a computer will always have a "calculator" that is far more efficient than the LLM at its disposal and it's idiocy to care about whether it does that part well the tedious and hard way or knows how to describe the problem to a more efficient tool

I also find it comical that people use tasks where LLMs behaviour is if anything mot human-like, in its tendency to lose focus and start taking shortcuts (before GPT4 started writing Python instead, it'd for a while try really hard to not give you a step by step breakdown and instead clearly take shortcuts even you prompted it heavily to reason through it step by step), when presented with stupidly repetitive tasks as examples of how they're not good enough.

imtringued
0 replies
16m

Is this rant really necessary? Most models, especially ChatGPT4 can perform carry based addition and there is zero reason for them to fail at it, but the moment you start using quantized models such as the 5 bit mixtral 8x7b the quality drops annoyingly. Is it really too much to ask? It's possible and it has been done. Now I'm supposed to whip out a python interpreter for this stuff, because the LLM is literally pretending to be a stupid human, really?

emmender2
0 replies
2h59m

this goes into the heart of what it means to "know".

All human knowledge is "symbolic". that is, knowledge is a set of abstractions (concepts) along with relations between concepts. As an example, by "knowing" addition is to understand the "algorithm" or operations involved in adding two numbers. reasoning is the act of traversing concept chains.

LLMs dont yet operate at the symbolic level, and hence, it could be argued that they dont know anything. LLM is a modern sophist excelling at language but not at reasoning.

dns_snek
3 replies
1d3h

The claim was that "LLMs can do math". Below they linked a model from Google that might be capable of that, but as a general rule (and with OpenAI's models specifically) LLMs can't "do math" by any reasonable definition.

vidarh
1 replies
1d3h

I've had it do plenty of math. Some it does badly at, some it does fine. Generally it's not "disciplined" enough to do things that requires lots of rote repetitive tasks, but neither are most humans, and that has improved drastically as they've adjusted it to instead do what most humans do and use tools. Would it be nice if it also got more willing to "stick to it" when given rote tasks? Sure.

But whether or not it can "do maths" to your definition depends very much on what you want it to do, and how you define "do maths". To me it's irrelevant if it's doing the low-level calculations as long as it knows how to express them as code. If I wanted a calculator I'd use a calculator. And I don't consider a calculator able to "do math" just because it can precisely add numbers.

Meanwhile I've had lengthy discussions with GPT about subjects like orbital mechanics and calculating atmospheric effects where it correctly used maths that I had to double-check not because I didn't trust GPT (though I also want't to verify for that reason) but because I didn't know the maths (not that it was anything particularly advanced, but I lost interest in maths during my CS degree and picked the minimum amount of maths I could get away with).

By my definition it can "do maths" just fine. I guess you don't consider my view of that "reasonable". I can live with that, as meanwhile, it will keep doing maths for me when I need it.

Of course this was also a case of moving the goalposts to set up a strawman - in the comment of yours I replied to, you claimed it couldn't reliably add two numbers.

dns_snek
0 replies
19h59m

It often fails at basic 3-4 digit arithmetic. If you're stretching that definition far enough to claim that GPT4 can "do math" then I should be able to call myself a commercial pilot because I can land a plane in a sim 20% of the time.

I'm not moving goalposts, the original claim was that LLMs can "do math". Primary school arithmetic is math.

GPT-4 can't do math and that's okay, I don't understand why so many of you are so touchy and defensive about this. It's a limitation that exists, nothing more, nothing less.

int_19h
0 replies
22h19m

GPT-4 is a tiny subset of "LLMs".

If you train a model to do math (and optimize representation for that), it'll do math. GPT-4 just isn't, and, generally speaking, they aren't, because it's much more efficient to train them to "use a calculator". Same as with humans.

mikewarot
0 replies
20h51m

GPT-x can't add, or subtract, or do anything else of the type... it can APPEAR to do so, because that's what it was built to do.... act like the text it's seen previously and predict what the next text would be.

If you include a large amount of properly solved math in its training text, it gets MUCH better at that kind of math.

It has a very deep set of intelligences that are alien to us, that allow it to predict and ACT LIKE us, when it comes to generating the next word. You're only seeing the output of those intelligences through a very lossy channel.

As a side note, there are structures in human language that apparently encode much more information that you might think at first glance. The fact that Word2Vec had such mathematical properties, despite it's relative simplicity, astound me to this day. Throwing a bunch of sine/cosine values on top of that to represent position in a sentence to enable LLMs is also amazing in that it works.

jart
1 replies
1d4h

What makes you think that? Which LLMs?

lovasoa
0 replies
1d

- Hey ChatGTP ! What it 69*94 ?

- The result of 69*

94 is 6466.

ekianjo
0 replies
1d4h

most open models do it poorly though. ChatGPT is better at it.

cooper_ganglia
0 replies
22h31m

This comment reminded me of that scene in Indiana Jones where the guy is spinning the sword around about to attack Indy, and then Indy just pulls out his pistol and shoots him.

ordu
2 replies
20h53m

> the modern civilized world could only emerge —for better or for worse— when Western Europe could free itself from the fetters of medieval scholasticism

I can propose an alternate view of things. Not that I'm going to argue that it is the only true statement in the world, but I think it is necessary for a thought to progress to have an alternative hypothesis.

So the proposition is: formal symbolisms can deal only with those problems that where already solved in imprecise human's languages.

To invent calculus and orbital mechanics you need first to talk for a several centuries (or thousands of years?) about what is position and velocity, you need to talk your way upto acceleration, and then you need to find a way to measure them and to define in a strict geometric terms. Ah, and infinity, it was a very counter-intuitive idea, Xenon invented some of his paradoxes specifically to point at counter-intuitiveness. When Newton came all these talks and debates did the most of work for him.

> the ability to understand the human tongue is insignificant compared to the power of math.

But the fun is: you cannot know if someone understands math if they do not understand human language too. You cannot teach math to those who cannot speak human language.

Math is a cream on top with a limited applicability. What math can say about love? I do not like to sound like Dumbledor, but really behind all we do there is an emotions motivating us. Math cannot deal with emotions, because it was built that way and because non-math talks about emotions hadn't bring a good model for emotions, which math could express in a formalized language.

> Dijkstra says

I wonder when he said it? Before AI concluded that expert-systems based on logic were acknowledged to be a failure or after that?

eru
1 replies
16h5m

So the proposition is: formal symbolisms can deal only with those problems that where already solved in imprecise human's languages.

To invent calculus and orbital mechanics you need first to talk for a several centuries (or thousands of years?) about what is position and velocity, you need to talk your way upto acceleration, and then you need to find a way to measure them and to define in a strict geometric terms. Ah, and infinity, it was a very counter-intuitive idea, Xenon invented some of his paradoxes specifically to point at counter-intuitiveness. When Newton came all these talks and debates did the most of work for him.

For the sake of argument, let's grant your story about what you need to invent calculus.

But once you invented calculus, you can then use it to solve all kinds of problems that you would never in a thousand years be able to handle with mere talk.

ordu
0 replies
2h59m

> all kinds of problems that you would never in a thousand years be able to handle with mere talk

Not "all kinds of problems" but very specific kinds of problems which is possible to formalize into a math language. How would you go about inventing thermodynamics if you didn't know words "temperature" and "pressure"? You'd need to start for your senses that can tell you "this is a hot surface", or "this is a cold one", or "this one is colder than that", you need to decide that "coldness" is a "negative heat" (it is not the most obvious idea for an animal, because animals have as receptors for a cold, so receptors for a heat, you could feel hot and cold at the same time, if you managed to stimulate both kinds of receptors at the same time). Then you need to notice that some materials change volume when heated, then you need to come up with an idea to use measurements of a volume to measure a temperature, and only then you can try to invent pV=nRT, which becomes almost tautological at that point, because your operational definition of a temperature makes it equivalent to a volume.

After that you really can use calculus and make all sorts of quantitative statements about thermodynamic systems. But before all that "mere talk" was finished thermodynamics was not a kind of a problem calculus can deal with.

samatman
0 replies
1d4h

I agree with your basic thesis here, retrospection will view LLMs as a transitional architecture.

However, this paper is evidence that the field is figuring out how to built what's actually needed, which is a good thing.

rvnx
8 replies
1d5h

It's coming in October with the new Apple chip

sigmoid10
7 replies
1d5h

I'd be very surprised if Apple can put something on the level of GPT4 on a handheld. Remember, GPT4 is estimated to be around 1.7 trillion parameters. That's 3.4TB at 16 bit and it would still be ~340GB at 1.58bits. The best we can hope for is a low-ish level few billion parameter model. Which would still be cool on a phone, but as of today these models are nowhere near GPT4.

ynniv
4 replies
1d4h

You don't need "GPT4" though. Mixtral 8x7B is robust and can be run in 36 Gb, 24 Gb if you're willing to compromise. A 1.5 bit quantization should bring it down to 16. That's still a lot compared to the iPhone 15's 6, but it's close enough to imagine it happening soon. With some kind of streaming-from-flash architecture you might be in the realm already.

creshal
3 replies
1d3h

With some kind of streaming-from-flash architecture you might be in the realm already.

I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.

liuliu
0 replies
23h45m

There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.

For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.

imtringued
0 replies
1d2h

I'm not sure what use that is, other than to maintain the KV cache across requests.

cjbprime
0 replies
21h10m

You need all of the model in RAM to perform the matmult that gets you the next token from it. There's no shortcut.

jairuhme
1 replies
1d5h

They won't have something at that size because as you pointed out, it is still huge. But depending on how they are used, smaller parameter models may be better for specific on-phone tasks that start to make the size of the model not a problem. GPT4 is so large because it is very general purpose with the goal seeming to be to answer anything. You could have a smaller model focused solely on Siri or something that wouldn't require the parameter size of GPT4

sigmoid10
0 replies
21h24m

The thing a about GPT4 that matters so much is not just raw knowledge retention, but complex, abstract reasoning and even knowing what it doesn't know. We haven't seen that yet in smaller models and it's unclear if it is even possible. The best we could hope for right now is a better natural language interface than Siri for calling OS functions.

declaredapple
2 replies
1d5h

I'll agree with you, and add that inference speed is a big factor too.

SDXL-ligtning/cascade can generate images in 200ms which is fast enough to fit in a web request, and paradoxically makes it even cheaper to generate.

And using groq at 500 t/s is wild compared to any of the other platforms.

pennomi
1 replies
1d1h

500 t/s is uncomfortably fast to me. Generating high quality answers at speeds faster than I can read is the point at which I feel like LLMs are magic.

I’m glad people are doing it though, and I’ll happily adapt to accessing inference at that speed.

azinman2
0 replies
23h42m

That's important for new applications to emerge where this happens on lots of data. You can't run LLMs at scale on tasks like Google might (every webpage) when the cost of each document is so high to process. Interactive chatbots are just the tip.

gitfan86
0 replies
1d4h

That is the plan. Even if these independent software improvements don't create 10x improvements NVDA and others are making huge improvements.

btbuildem
4 replies
1d5h

If this dethrones Nvidia, it would be a wonderful side effect

rafaelero
3 replies
1d4h

It's more likely that Nvidia will offer support to INT2 in the next generation and keep their dominance.

Klipper3
1 replies
20h6m

INT2 ternary is equivalent to INT1 + binary mask. Nvidia supprted INT1 matrix multiply in RTX20 and RTX30 generations, nobody used it, so they removed INT1 support from RTX40 generation.

ByThyGrace
0 replies
5h2m

What I get from your comment is now older RTX gens are going to be in high demand soon.

ActionHank
0 replies
6h27m

"next generation" those two words mean a whole lot.

Intel and AMD could also implement support in their "next generation" and that would be huge.

anon373839
3 replies
1d5h

It also means the largest models can be scaled up significantly with the same inference budget.

llm_trw
2 replies
1d5h

Depends. The only paper they cite for training: https://arxiv.org/pdf/2310.11453.pdf doesn't improve training costs much and most models are already training constrained. Not everyone has $200m to throw at training another model from scratch.

arunk47
1 replies
20h43m

Is there any scope for indie builders?

llm_trw
0 replies
17h57m

Not really. These are slightly better for memory during pre-training and fine turning but not enough to make a 4090 usable even for a 7b model.

wongarsu
1 replies
1d5h

I wouldn't be surprised if this causes hardware startups to pop up that build accelerator cards tuned for this architecture. It seems stupidly simple to do inference in hardware, and with most of the training being quantized as well you might even be able to provide speedups (and energy savings) for training with reasonable investment and on cheaper processor nodes than what Nvidia is using.

Sure, Nvidia might eat their lunch in a couple of years, but bitcoin ASICs prove that you can have a niche producing specialized processors, and VCs would probably jump at the thought of disrupting Nvidia's high margin business.

anon291
0 replies
1d

There's like a million startups promising analog / bit-level computation, inference-only, cheap computation.

There's rain.ai, d-matrix, etc.

osigurdson
40 replies
1d5h

I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?

sigmoid10
10 replies
1d5h

People are working on that [1]. In some sense, it's a step back to analog computing. Add/multiply is possible to do directly in memory with voltages, but it's less versatile (and stable) than digital computing. So you can't do all calculations in a neural network that way, meaning some digital components will always be necessary. But I'm pretty sure analog will make a comeback for AI chips sooner or later.

[1] https://www.nature.com/articles/s41586-023-06337-5

trebligdivad
5 replies
1d

Trinary however is an interesting middle; people have built trinary hardware long ago; it feels like you could make natively trinary hardware for something like this; it might even be quite a win.

int_19h
2 replies
22h23m

People haven't built reliable ternary electronics, though. Soviets tried with Setun, but they eventually had to resort to emulating each trit with two hardware bits (and wasting one state out of the possible four).

eru
1 replies
16h14m

If you are are using two bits anyway, you might as well represent (-2, -1, 0, 1) instead of ternary?

int_19h
0 replies
14h55m

Sure, but then you lose the symmetry that makes trits so convenient for many things.

thsksbd
1 replies
1d

Can you make a "CMOS" three voltage level circuit though? One where the only current flow is when the state changes?

Im not in this field but that's a question that's been bugging me for a while. Off you can't do this wouldn't energy consumption balloon?

neomantra
0 replies
5h44m

My friend was working on this in the mid-90s at Texas Instruments. Not sure what the underlying semiconductors were, but it did involve making ternary logic via voltage levels. Just searched a bit and found this TI datasheet which might be an example of it (high logic, low logic, high impedance), but maybe not: https://www.ti.com/lit/ds/symlink/sn74act534.pdf

zcw100
2 replies
1d4h

Reminds me of my father saying something about how vacuum tubes are great integrators.

monocasa
1 replies
1d2h

Chips are too. Opamps can add, multiply, subtract, divide, integrate and differentiate depending on how they're plugged in.

klysm
0 replies
23h39m

Hence the name 'operational' amplifier

irrelative
0 replies
1d1h

Hadn't thought about it this way before, but given that LLMs are auto regressive (use their own data for next data), they're sensitive to error drift in ways that are rather similar to analog computers.

gryn
3 replies
1d5h

the reason why digital/numeric processing won is the power loss in the analog world. when design an analog circuit the next processing stage you add at the end has impact on the ones before it.

this then require a higher skill from the engineers/consumers.

if you want to avoid that you need to add op-amps with a gain of 1 at the boundary of each one, this also that care of the power loss at each stage.

the other part is that there's a limit of to the amount of useful information/computation you can do with analog processing too once you take into account voltage noise. when you do a comparison there are stages where analog win but also place where where digital wins.

I'll edit later this with a link to some papers that discuss these topics if I manage to find them in my mess.

im3w1l
0 replies
1d

For the specific case of neural networks they seem to be very resistant to noise. That's why quantization works in the first place.

dazed_confused
0 replies
1d

Good explanation. When I was working at a semiconductor manufacturer, our thresholds were like 0 - 0.2V to 0.8 - 1.0V. Additionally, if you look at QLC SSDs, their longevity is hugely degraded. Analog computing is non-trivial, to say the least.

StableAlkyne
3 replies
1d3h

It would be something of a full circle I feel went back to dedicated circuits for NNs - that's how they began life when Rosenblatt built his Perceptron.

I remember reading a review on the history in grad school (can't remember the paper) where the author stated that one of the initial interests in NNs by the military was their distributed nature. Even back then, people realized you could remove a neuron or break a connection and they would still work (and even today, dropout is a way of regularizing the network). The thinking was that being able to build a computer or automated device that could be damaged (radiation flipping bits, an impact destroying part of the circuit, etc) and still work would be an advantage given the perceived inevitably of nuclear war.

Compared to a normal von Neumann machine which is very fault intolerant - remove the CPU and no processing, no memory=no useful calculation, etc. One reason people may have avoided further attempts at physical neural networks is it's intrinsically more complex than von Neumann, since now your processing and memory is intertwined (the NN is the processor and the program and the memory at the same time).

kurisufag
1 replies
1d3h

von Braun machine

von neumann? though it is funny to imagine von braun inventing computer architecture as a side hustle to inventing rocket science.

StableAlkyne
0 replies
1d3h

Oh fuck, thanks for catching that!

intalentive
0 replies
3h17m

The US military’s interest in network robustness led to the internet if I’m not mistaken.

Also preceding the perceptron was the McCulloch & Pitts neuron, which is basically a digital gate. NNs and computing indeed have a long history together.

the8472
2 replies
1d3h

Bits are copyable without data loss. Analog properties of individual transistors are less so.

eru
1 replies
16h12m

Yes, but the whole point of the link submitted to HN here is that in some applications, like machine learning, precision doesn't matter too much.

(However, analog computing is still a bad fit for machine learning, because it requires a lot more power.)

the8472
0 replies
4h40m

Exact copies aren't just about precision but also about reproducibility.

mikewarot
2 replies
21h23m

maybe NAND gates are not the ideal fundamental building block here?

It's my long held opinion that LUTs (Look Up Tables) are the basis of computation for the future. I've been pondering this for a long time since George Gilder told us that wasting transistors was the winning strategy. What could be more wasteful than just making a huge grid of LUTs that all interconnect, with NO routing hardware?

As time goes by, the idea seems to have more and more merit. Imagine a grid of 4x4 bit look up tables, each connected to its neighbors, and clocked in 2 phases, to prevent race conditions. You eliminate the high speed long lines across chips that cause so much grief (except the clock signals, and bits to load the tables, which don't happen often).

What you lose in performance (in terms of latency), you make up for with the homogenous architecture that is easy to think about, can route around bad cells, and be compiled to almost instantly, thanks to the lack of special cases. You also don't ever have to worry about latency, it's constant.

phdelightful
1 replies
20h56m

It’s been a long time since I worked on FPGAs, but it sounds like FPGAs! What do you see as the main differences?

mikewarot
0 replies
20h42m

No routing, no fast lines that cut across the chip, which cut way down on latency, but make FPGAs harder to build, and especially hard to compile to once you want to use them.

All that routing hardware, and the special function units featured in many FPGAs are something you have to optimize the usage of, and route to. You end up with using solvers, simulated annealing, etc... instead of a straight compile to binary expressions, and mapping to the grid.

Latency minimization is the key to getting a design to run fast in an FPGA. In a BitGrid, you know the clock speed, you know the latency by just counting the steps in the graph. BitGrid performance is determined by how many answers/second you can get from a given chip. If you had a 1 Ghz rack of BitGrid chips that could run GPT-4, with a latency of 1 mSec per token, you'd think that was horrible, but you could run a million such streams in parallel.

wakawaka28
1 replies
1d5h

I have heard of people trying to build analog AI devices but that seems like years ago, and no news has come out about it in recent times. Maybe it is harder than it seems. I bet it is expensive to regulate voltage so precisely and it's not a flexible enough scheme to be support training neural networks like we have now, which are highly reconfigurable. I've also heard of people trying to use analog computing for more mundane things. But no devices have hit the market after so many years so I'm assuming it is a super hard problem, maybe even intractible.

osigurdson
0 replies
1d5h

Perhaps another variation on the idea is to allow a higher error rate. For example, if a 0.01% error rate was acceptable in AI, perhaps the voltage range between states could be lowered (which has a quadratic relationship to power consumption) and clock speed could increase.

seydor
1 replies
23h43m

let's use cells

Razengan
0 replies
22h37m

We already do.

loudmax
1 replies
1d5h

The Veritasium Youtube channel did a video about this about a year ago: https://www.youtube.com/watch?v=GVsUOuSjvcg

They visit Texas company Mythic AI to discuss how they use flash memory for machine learning. There's a California company named Syntiant doing something similar.

dwightboyyy
0 replies
16h19m

I was thinking of this exact video, crazy to think that the principle is gaining momentum

eru
1 replies
16h17m

Analog computing for neural networks is always very tempting.

We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range?

Transistors that are fully closed or fully open use basically no energy: they either have approximately zero current or approximately zero resistance.

Transistors that are partially open dissipate a lot of energy; because they have some current flowing at some resistance. They get hot.

In addition, modern transistors are so small and so fast that the number of electrons (or holes..) flowing through them in clock cycle is perhaps in the range of a few dozen to a hundred. So that gives you at most 7 bits (~log_2(128)) of precision to work with in an analog setting. In practice, quite a bit less because there's a lot of thermal noise. Say perhaps 4 bits.

Going from 1 bit per transistor to 4 bits (of analog precision) is not worth the drastically higher energy consumption nor the deviation from the mainstream of semi-conductor technological advances.

baq
0 replies
10h23m

As someone who knows almost nothing about electronics I assume you’d want a transistor which can open in two ways: with positive and negative voltage. I’ve seen TNAND built out of normal transistors, not sure if such exotic ones would help even if they were physically possible.

drexlspivey
0 replies
23h38m

Next Up: Quantum AI

barrenko
0 replies
1d4h

Hmm, maybe some (signaling) inspiration from biology other than neural signaling.

adrianN
0 replies
1d5h

You could call them connection machine and perhaps have an llm trained on Feynman help with the design.

MagicMoonlight
0 replies
6h5m

It’s going to be funny if it turns out biology was right all along and we end up just copying it.

BlueTemplar
0 replies
1d4h

I have heard that the first commercial neural network chip (by Intel, in the 90s) was analog ?

gojomo
16 replies
1d

That's not a 'bit' ("Binary digIT"). It's closer to a 'trit' ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0, 1} (rather than the usual {0, 1, 2} in a base-3 numbering system) are 'balanced ternary'.

A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:

http://web.archive.org/web/20011205185830/http://americansci...

In an aside, the article hints that e-nary digits (base 2.718…) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).

So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.

(As it may be of renewed interest, I've also put this 2001 "American Scientist" base-3 intro as a new HN submission for discussion: https://news.ycombinator.com/item?id=39541756)

Razengan
4 replies
23h3m

Why not a tit?

esafak
2 replies
21h25m

They renamed the biggest ML conference (NIPS) over the same joke, so don't count on it.

Razengan
1 replies
15h21m

Why? Just because it's spelled identical to a human body part?

This kind of shit is one of the most bizarre things about human society (or the prude cultures of it at least), to consider the most natural things so taboo and a "joke" to mention.

jdiff
0 replies
22h57m

Because bi- is two, tri- is three. Ti- is meaningless, and not good enough of a joke to make up for it.

o11c
1 replies
22h9m

FSVO "optimal". In practice, both physical reality and algorithm design strongly favors base 2.

nighthawk454
0 replies
22h2m

Yeah, specifically, the definition of optimal provided - radix economy. There are plenty of other considerations one could make in other contexts. Practically, a transcendental base seems... rather impractical. And base 2 is not so much 'more optimal' than base 3 to warrant the electrical complexity probably, for example.

eru
0 replies
16h3m

Negative bases are probably even better, because you can represent negative numbers without worrying about extra sign handling.

dekhn
3 replies
23h7m

How useful are -0 and 0? You could splurge on two bits per value which gives you { -1, -0, 0, 1 }

schiffern
1 replies
18h8m

Rather than (indistinguishable?) 0 and -0, why not add back some magnitude in the positive direction?

  { -1, 0, 1, 2 }
is most obvious, but it's not clear whether it's better or worse than

  { -1, 0, 1/2, 1 }
Maybe theoretically (if not architecturally) it would best to "split the difference" between the two and choose

  { -1, 0, 1/phi, phi }
or perhaps the more implementable

  { -1, 0, 1, 3 }


EDIT: Of course you can also go the other way, with

  { -1, 1 }

dekhn
0 replies
18h3m

-0 is not indistinguishable from 0 in floating point math. Most ops return +0 and -0 can behave differently. I don't know of any examples where -0 is important for machine learning, though.

ant6n
0 replies
22h21m

3^5=243. so use a byte to represent a vector of 5 ternary values, leaving some possible signaling values.

no_identd
0 replies
23h36m

See also:

https://en.wikipedia.org/wiki/Nat_(unit) (make sure to read the footnotes, too)

Edit: See also also, on the radix economy of balanced ternary (called "tristate") vs base 3: https://web.archive.org/web/20090312094241/http://abhijit.in... + a wild Marvin Minsky appears: https://archive.fo/gL2Bv

That page also brings up the whole "but division" problem with balanced ternary, however, I personally suspect that http://degiorgi.math.hr/aaa_sem/Div_Krishna/887-889.pdf ("A Division Algorithm for Signed-Digit Arithmetic" by Chin Tung, from 1968 !) might offer an overlooked path to a solution to that problem

And see also also², this quote from TAOCP:

"Cauchy pointed out that negative digits make it unneccesary for a person to memorize the multiplication table past 5x5."

The—INCREDIBLY ANNOYING TO LOCATE—source for which is "105. Calculs numériques. sur les moyens d'éviter les erreurs dans les calculs numériques." on Pdf page 445/document page 431 here:

https://www.e-rara.ch/download/pdf/5702285?name=Tome%2520V%4...

See also also³: https://pdfs.semanticscholar.org/5f77/b1cf105024b41b6824ba91... (Vince, Andrew - Radix Representation and Rep-Tiling)

( +a vaguely related paper here on quantum mechanics & radix economy, BUT it makes the mistake of using an overly specific formula applicable only to unsigned-digit representations thus drawing the wrong conclusions: https://www.researchgate.net/profile/Vladimir_Garcia-Morales... )

kleiba
0 replies
22h20m

Note that they're not claiming that their LLM is 1-bit - they're saying that there is a 1-bit era of LLMs. What they do say is that their approach is a variant of a 1-bit LLM variant, namely a ternary LLM (they explicitly state that in the abstract).

bee_rider
0 replies
23h41m

It is obviously pretty common to represent matrices with lots of zeros in a sparse format, like csr or something. I wonder if they could get away with 1-bit representation using a sparse matrix. Of course, it would be a little different from a typical sparse matrix because there’s no problem normally having a zero-value in a structurally non-zero location.

anon373839
15 replies
1d6h

BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.

BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.

• 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.

• 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.

• 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.

This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.

Does it seem at all likely that existing models could be converted?

accurrent
6 replies
1d6h

They seem to be using LLAMA. Might be worth trying out. Their conversion formula seems stupidly simple.

wongarsu
4 replies
1d6h

However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.

imjonse
3 replies
1d6h

They present numbers for 7B to 70B models as well.

sp332
1 replies
1d6h

They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.

imjonse
0 replies
1d5h

You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."

anon373839
0 replies
1d6h

Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.

FrustratedMonky
0 replies
1d5h

Yes. I wonder then how long before someone that does have a lot of compute power like OpenAI/MS, or others, can rapidly pivot and try this out on some even larger models.

Doesn't this mean that current big players can rapidly expand by huge multiples in size.?

btbuildem
4 replies
1d5h

Discussion on HF [1] implies that no, conversion is not helpful. It would take training the model from scratch.

1: https://huggingface.co/papers/2402.17764

anon373839
3 replies
1d3h

It’s a pity if realizing these gains absolutely requires full pre-training from scratch. I imagine more than a few people will at least try to find a way to repurpose the knowledge contained in existing models.

cooljoseph
2 replies
20h56m

You can also have another model "mentor" a new model you are teaching to speed up training. You don't have to start from scratch with zero knowledge. This is done a lot in what are called distillations.

fnordpiglet
0 replies
17h20m

This came out a little bit ago, my open question is if this approach can be used to port weights between architectures like this.

https://arxiv.org/abs/2402.13144

eru
0 replies
16h24m

You can also re-use a lot of the infrastructure. Eg you can re-use your training data.

ignoramous
2 replies
1d5h

I wonder if 1bit quantization is the main reason why pplx.ai is faster than any other RAG or chatbot. For instance, Gemini in comparison is a turtle, though it is better at explanations, while pplx is concise.

vitorgrs
0 replies
12h42m

Nop. The model on Perplexity is a finetuned GPT 3.5 (the free one). And the paid versons, well, you can choose between GPT4 (not turbo), Gemini pro, Claude, etc.

You can choose their model ("Experimental"), but is not faster than the other models.

All of these, proprietary models are fast on Perplexity. I do guess they are using some insane cache system, better API infrastructure...

refulgentis
0 replies
17h21m

Absolutely not, 1 bit isn't even real yet. perplexity does a ton of precaching, TL;Dr every novel query is an opportunity to cache: each web page response, the response turned into embeddings, and the LLM response. That's also why I hate it, it's just a rushed version of RAG with roughly the same privacy guarantees any incumbent would have given you in last 15 years (read: none, and gleefully will exploit yours while saying "whoops!")

w-m
13 replies
1d3h

I was reading Exposing Floating Point today (as Airfoil is on the HN front page and I was perusing the archive of the author). It's a blog explaining the inner workings of floating point representations. About zero values it says [0]:

Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f won’t fit in a float, however, it will produce the value of -0.0.

The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I'm now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?

Also, I've seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?

[0] https://ciechanow.ski/exposing-floating-point/#zero

creshal
5 replies
1d3h

Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?

Probably, but is it worth the cost? One of the goals behind BitNet and this paper is to find a way to implement LLMs as efficiently in hardware as possible, and foregoing floating point semantics is a big part of it. I'm not sure if there's a way to encode -0 that doesn't throw out half the performance gains.

SushiHippie
4 replies
23h39m

But if I understand it correctly, they already need to use 2 bits, one for the sign and another one for the value, so there is already one wasted state, which could be used for -0.

pennomi
3 replies
20h17m

You can pack two trits into three bits, however. So one byte could hold 5 values instead of 4.

threatripper
1 replies
19h31m

How exactly would you do that? 3 states need 1.58 bits which is a tad more than 1.5. Two 3-states have 3²=9 states while three bits only give you 2³=8 states.

creshal
0 replies
7h52m

I wonder if there's some encoding tricks you can use to reduce it to 8 (or less?) effective states, given that you're only using them with a reduced set of mathematical operations. E.g., can you automatically convert all (-1, 1) to (1, -1) and save one encoded state, since they add up to the same result anyway?

para_parolu
0 replies
19h53m

Can processor perform addition on them effectively?

rfoo
3 replies
1d3h

Interesting, how do you use -0 in the add, then? Is -0+1-1 a 0 or a -0?

Could the additional -0 carry some pseudo-gradient information

It looks like training was done on fp32 or bf16. Low-bit quantization is approximated with STE during training. I'd expect training itself cause each point to "polarize" towards 1 or -1.

2-bit quantizations being proposed

Symmetric (i.e. without 0) exponential values were pretty popular IIRC.

w-m
1 replies
1d3h

how do you use -0 in the add

In my mind the two zero values would represent a tiny epsilon around 0, let's say -0.01 and +0.01. Looking at them like this, it would mean

  +0 +0 -0 = +0
  +0 -0 -0 = -0
  +1 * +0 = +0
  -1 * +0 = -0
Performing addition with the same sign count in each group would be problematic. How to decide on the sign of +0-0 or +1-1, other than flipping a coin?

npunt
0 replies
19h27m

maybe they could be stored together in two words until they're operated on and lose their pairing?

paipa
0 replies
7h29m

Or use -1, 0, 1/2, 1 where the new half-weight is still a cheap bit shift.

fabiospampinato
1 replies
16h27m

I would guess that having 2 zeros is not that useful for NNs, but in general with 2 bits we could encode 4 states, so are there 4 possible states that would be useful to encode? Sure, but would this be better than encoding 3 states? That's the entire question imo. I would guess that 3 states are probably better, because negative/neutral/positive seems the minimal signal that we need these weights to provide.

eru
0 replies
16h8m

You could use a negative-two base, and encode {-2, -1, 0, 1}. See https://en.wikipedia.org/wiki/Negative_base

Or you could use the regular positive-two base and encode {-2, -1, 0, 1} the normal way with two's complement.

raghavtoshniwal
12 replies
1d5h

Sooo, short Nvidia?

sebzim4500
7 replies
1d5h

These still run on GPUs

londons_explore
3 replies
1d5h

GPU's aren't yet awfully efficient at 1 bit math.

I could imagine FPGA designs might be competitive.

And dedicated ASIC's would almost certainly beat both by a decent margin.

sebzim4500
1 replies
1d5h

I'm very unconvinced that ASICs are better suited for this than for FP16/FP8 models that are being used today.

londons_explore
0 replies
1d3h

BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)

Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.

Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)

So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.

int_19h
0 replies
22h12m

I don't think it would be difficult to make them efficient.

The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.

leroman
2 replies
1d5h

- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)

- as most work is inference, might not need for as many GPUs

- consumer cards (24G) could possibly run the big models

sebzim4500
1 replies
1d5h

If consumer cards can run the big models, then datacenter cards will be able to efficiently run the really big models.

leroman
0 replies
1d4h

Some tasks we are using LLMs for are performing very close to GPT-4 levels using 7B models, so really depends on what value you are looking to get.

MadDemon
2 replies
1d5h

Depends if this results in more efficient models or simply larger, more capable models.

wongarsu
1 replies
1d5h

In both cases this is a prime opportunity for anyone to disrupt Nvidia. They are in this market position in large part because both video games and neural networks do a lot of highly parallel floating point math, especially matrix multiplication. This model architecture doesn't do any of that.

Of course it should be fairly simple for Nvidia to add special silicon and instructions for two-bit addition to a future generation of their cards. But it'll take a while because they already have a roadmap and preexisting commitments. And any competitor doesn't have to copy everything Nvidia does to make floating point numbers go fast, they can just focus on making two-bit data handling and addition go fast.

kromem
0 replies
18h34m

Yes, but with their current market cap, the more likely result is they acquire one of the several competitors poised to take advantage of this and throw massive resources behind them.

etiam
0 replies
1d4h

Hardly for this reason, but it does look suspiciously high doesn't it.

transfire
9 replies
1d8h

Shouldn’t that be “1-trit”?

bmacho
4 replies
1d5h

Read the pdf https://arxiv.org/pdf/2402.17764.pdf they call it 1-bit everywhere.

I don't know why do they do this, 1-bit seems to be a very wrong name for {-1, 0, 1}.

edflsafoiewq
2 replies
1d5h

I think 0 "doesn't count", since you don't have to add or subtract anything for it, just mask it out.

paipa
0 replies
1d4h

Would be cool to see what happens if you quantize towards zero preferentially. Sparsifying the matrix should improve inference speed directly, right?

FrustratedMonky
0 replies
1d5h

Yes Technically, but it is catchy for the masses. 1-bit seems to get the idea across, even if not technically describing {-1,0,1}.

QuesnayJr
3 replies
1d6h

They call it 1.58-bit in the paper. (1.58 is roughly the base 2 logarithm of 3.)

jmmcd
2 replies
1d5h

So by “1-bit” they mean “less than 2 bits”. AI is an insufferable field at times like this.

riskable
1 replies
1d4h

What else are they going to call it? Nobody wants to say they wrote some two-bit paper about AI!

jmmcd
0 replies
1d3h

The whole thing is a bit of a scam

the8472
7 replies
1d6h

What does it mean for future hardware if it's not using floating point matrix multiplication units?

gpderetta
2 replies
1d5h

As per answer, the reason float is faster than in is because a) hardware companies provide float ALUs than integer ALUs and b) float FMA is a thing, while integer FMA isn't. Both are because currently most HPC-like loads use floats instead of integers, not because of intrinsic hardware reasons.

KeplerBoy
1 replies
1d5h

If it's desired integer performance could far exceed float performance, since ALUs need less die area than FPUs.

If this paper holds, I'd expect that's where custom accelerators will be heading.

gpderetta
0 replies
1d5h

Oh, I agree, I'm just saying that there is no reason in principle for floats performance to be better than integer.

edit: also this might be implementable purely using bitwise vector operations. Would need to check the throughput of those.

KeplerBoy
1 replies
1d5h

Expect Nvidia to advertise with their TOPS numbers instead of their FLOPS.

rfoo
0 replies
1d5h

Already happened years ago. They advertised TOPS for int8/int4 [0], and with 50% sparsity [1].

[0] low-bit CNNs worked pretty well actually.

[1] Totally useless marketing snake oil.

kromem
0 replies
18h40m

This opens the door to very exciting hardware shifts, like to optical computing, where there's already been over a decade of research on ternary optical computing and other parallel research at using optical computing for more efficient neural networks.

If this really holds up, it likely means we'll be moving to new dedicated hardware for AI compute much faster than when it was FP.

londons_explore
7 replies
1d5h

Powers of 3 don't pack well into binary memory...

A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state 'weight' is bigger.

I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.

tromp
2 replies
1d5h

5 trits fit into 1 byte pretty well, since 3^5 = 243 is just under 2^8 = 256.

That should be called an 8/5 = 1.6 bit model though, while the paper names it 1.58 bit, closer to log_2(3) ~ 1.5849625

londons_explore
0 replies
1d4h

But the decoder for that will be 25+ gates, which is huge compared to the handful of gates to use the resulting weights.

JKCalhoun
0 replies
1d4h

Would be nice to have hardware instructions that work on 5 tris natively.

samatman
0 replies
1d3h

It's optimal if your program is naturally ternary, which this one is. Using three signals, rather than ternary gates, is less effective, because you need much more precision to detect two different voltage levels rather than just up and down.

fasa99
0 replies
16h46m

I think it's the right chain of thought. You could either have 0/1 and then have additional nodes with negative activation functions, or -1/1

-1/1 is appealing to me (0 = -1) because bit hackery could be used instead of the multiplication function, presumably on integral or fixed-point representations. The goal would be to eliminate any "if/then" like "if 0 do this if 1 do that" to avoid the need for branch prediction - there are bit-hackery ways to bypass this. That would lend itself well to all existing processors, ASICs, FPGAs, GPUs, etc.

fabmilo
0 replies
20h28m

can't you have 2 bits ? first bit for the sign second bit for the 1 0 you can represent -1 +1 +0 -0

yieldcrv
5 replies
1d6h

This is great, my employer just gave me a M1 laptop with only 16gb ram and I had to downgrade my 7B parameter local LLM’s to 3 bit quantizing, they’ve been surprisingly okay!

In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4

Its Mistral all the way down! Imagining Q1.58 that’s doing well makes me happy

turnsout
1 replies
1d5h

Quantized 7B LLMs should work fine on your machine, though maybe you’re talking about speed?

yieldcrv
0 replies
1d2h

7B works fine

FergusArgyll
1 replies
1d4h

You shouldn't have to quantize it that much, maybe you're running a lot of other programs while running inference?

Also, try using pure llama.cpp, AFAIK it's the least possible overhead

regularfry
0 replies
1d4h

Getting more value out of phi-2-sized models is where you really want to be on lower-end M1's.

woadwarrior01
0 replies
1d4h

You can run 4 bit quantized versions of SOLAR-10.7B and Llama 2 13B based models quite well on 16GB M1 laptops.

imjonse
5 replies
1d6h

Too bad there seem to be no pretrained models to download. This is not a quantization method to apply on existing models, so having the pretrained weights is needed if one wants to test it.

UncleOxidant
1 replies
20h46m

link #2 appears to be broken.

bArray
0 replies
1h15m

Tested earlier, still seems to be working fine. I can only suggest to try a VPN/alternative DNS?

imjonse
0 replies
1d5h

Nothing there yet, but it's good to know they want to publish just did not get around to yet.

SushiHippie
0 replies
23h38m

From [2]:

We would definitely be happy to open-source the models for future research. Please stay tuned!
leroman
4 replies
1d5h

Can someone versed in the ways of math explain how this is different from previous quantization methods?

And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality / dumb down the model, how is this even less precise method is able to achieve this?

IanCal
1 replies
1d5h

It's not quantising existing models, they're training new ones.

leroman
0 replies
1d5h

I understand this part but it seemed that the 16->8->4 etc is similar to compression of the "net" and seemed to lower quality below 8.

kromem
0 replies
18h25m

So modern NNs aren't really using the network nodes in the structure they physically are, but essentially builds a virtual neural network using combinations of nodes (how you can model hundreds of parameters in only a dozen or so nodes).

So as the number of nodes scales up, the individual precision probably matters less and less. Which is what they found here - it reaches parity at 3B and then starts exceeding performance at larger sizes, up to the 2T tested.

Seemingly when trained from scratch the virtual network can find adequate precision from ternary physical nodes where needed. This is different from the information loss as an already trained floating point network has its weights quantized to smaller precision and sees a performance loss.

Not only is this approach more efficient, it seems to perform better too at larger network sizes, which is probably the most interesting part.

TheCoreh
0 replies
1d5h

If I understand it correctly, this seems to be more than just quantizing, the models are apparently trained in this format as well. So it's possible that the many layers adjust themselves in a way that "cancels out" the inaccuracies of the lower bit count

stormfather
3 replies
1d5h

How does backprop work here? I can't imagine flipping bits of everything upstream of an error is effective.

joelthelion
1 replies
1d4h

(haven't read the paper). Maybe you can flip bits with a probability distribution that depends on the gradient?

stormfather
0 replies
1d4h

That's an interesting idea! Would love to try that on MNIST one day.

spyder
0 replies
1d3h

From the BitNet paper:

"Straight-through estimator. To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13] to approximate the gradient during backpropagation. This method bypasses the nondifferentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model."

also the author's (@shumingma) answer in the comments: https://huggingface.co/papers/2402.17764#65df17ed4d436404cdc...

rafaelero
3 replies
1d4h

Looks like we have finally rediscovered a biological neuron.

bilsbie
2 replies
1d

How so?

rafaelero
1 replies
22h50m

They propagate information in a binary way (either they activate or not).

PhunkyPhil
0 replies
19h26m

Neurons activate on a gradient

llm_trw
3 replies
1d5h

So are there any details on the algorithms they used for backprop? I'm not seeing any in the paper other than "we used a lot of tokens".

IanCal
1 replies
1d5h

Does this help? https://arxiv.org/abs/2310.11453

It seems to have more details (it's the paper before the linked one) about the actual training, but I'm scanning it and this isn't my field so maybe it's too light also.

llm_trw
0 replies
1d5h

Not really, that's for the binary version of the algorithm, the ternary version can propagate a lot more information in the backwards pass using the fact outputs either -1, 0, 1.

But I imagine they are using the same thing since a bunch of the authors are the same.

wongarsu
0 replies
1d5h

It's a fairly straightforward modification of BitNet, so I assume this quote from the BitNet paper applies:

To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13 ] to approximate the gradient during backpropagation. This method bypasses the non-differentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model

kouru225
3 replies
21h4m

Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?

SushiHippie
2 replies
20h52m

Not true anymore, but it also highly depends on what your definition of "a good one" is.

Many people find Mistral 7B to be excellent, around gpt-3.5 level of good.

Mistral 7B normally requires like 20gb VRAM, but with llama.cpp and quantization, you could even run it on your phone (albeit bad quality).

Quantization >= q4_K_M seem to provide nearly as good responses as the unquantized model, and q4_K_M only needs ~7GB of VRAM.

See the table here:

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU...

Using ollama you can get up and running even a bit faster than with llama.cpp directly (ollama uses llama.cpp under the hood).

kouru225
1 replies
19h45m

Oh Jesus so basically it’s very feasible for me to run my own local llm on a NAS or a server or something… well I guess it’s time for me to get on with the times…

Thanks!

anon373839
0 replies
5h13m

Can confirm. Mistral 7B is subjectively comparable to GPT 3.5-Turbo, and the Elo scores at lmsys.org support this.

dindobre
3 replies
1d6h

Refreshing paper in terms of machine learning papers, simple explanation, easy to replicate, no alchemy-tier interpretations. Can't wait to see this paper replicated or disproved when it comes to real-life production tasks.

imjonse
1 replies
1d6h

The presentation is simplified because it implies knowledge of its predeccesor, BitNet https://arxiv.org/abs/2310.11453

dindobre
0 replies
1d6h

Makes sense!

wongarsu
0 replies
1d6h

The most glaring omission is that they only compared to fp16 models, not to quantized models. And of course the benchmarks might be misleading compared to the real experience.

But if you wanted to make LLM-specific hardware (or x64 instructions tuned for LLMs) this model architecture makes that extremely cheap. Multiplication requires a lot of transistors, this architecture requires only two-bit adders. You could make SIMD instructions that do thousands of these in parallel, for fairly little silicon cost.

bilsbie
3 replies
21h18m

This really just sounds absurd. How can ternary possibly encode enough information?

Anyone willing to explain it like I’m a Django developer who watched half a karpathy video?

Solvency
1 replies
17h30m

Because by making the model larger you don't need 64bit precision floats you only need 64 discrete bits.

gemeral
0 replies
10h47m

Do you mind pointing out where they make the model larger? The paper seems to suggest they are maintaining the same model sizes.

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption
barbarr
0 replies
12h11m

The activations are still 8-bit, so a lot of complexity and nonlinearity is still expressible. Only the weights are 1.58-bit.

tuananh
2 replies
1d6h

Major breakthrough in LLM scene. Achieve performance and perplexity equivalent to full FP16 models of same parameter size.

And you can fit 120B model with a single card 24GB VRAM. This is mind blowing.

cyanydeez
1 replies
1d6h

I mean, it expands the hardware selection, but until there's models and leader boards etc, can't really say it's a break through.

fnordpiglet
0 replies
17h14m

I would assume a GPU isn’t specifically optimized for ternary computation and specialized accelerators would whip the pants off a GPU

superdisk
2 replies
1d

Is there anything about this specific to LLMs, or could you use it for any transformer based model? It seems like they made a modified transformer.

kromem
1 replies
18h31m

It seems like it could be any transformer, which is exciting now that even in imaging gradient transformers are all the rage. But ideally we'd need to see this result in other transformers (but I have a hard time seeing why it wouldn't be the case).

riskable
0 replies
2h56m

At the very least it could be used to reduce the requirements and speed up the prompt recognition step(s) of image-based generative AI.

"Stable Diffusion 3 XS" will use ternary? Here's to hoping :)

singularity2001
2 replies
1d2h

So we almost go back full circle to human (animal) brain binary spikes?

concrete_head
1 replies
23h31m

It's not quiet spikes but getting closer to the idea. I'm amazed it has taken this long for this type of thing to reach HN which gives next to no attention to spiking neural networks.

Simon Thorpe, a CNRS researcher has got some fascinating papers and lectures on YouTube on using binary weights on neuromorphic hardware which has had practical applications for over 20 years already.

I made an account just to drop his name somewhere on this forum.

singularity2001
0 replies
12h42m

why is his name so dangerous you can't drop it on your main account lel?

jdthedisciple
2 replies
3h59m

People have been doing this 6 years ago.

    https://github.com/yashkant/quantized-nets
    https://github.com/TropComplique/trained-ternary-quantization
    https://github.com/buaabai/Ternary-Weights-Network
I too find it very interesting.

But why this sudden, renewed fuzz?

imtringued
0 replies
3h34m

Probably because despite the 1200 citations, they didn't have the ability to apply it to modern LLMs. Nobody cares about an image classifier using 50% less parameters since most of them were small enough to fit in memory anyway.

gerash
0 replies
1h0m

I haven't read the paper but I clearly remember 1-bit quantization from at least 5-6 years ago

esha_manideep
2 replies
20h42m

These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.

libertalia0
0 replies
2h11m

Highly interested in collaborating – got a bunch of proprietary legal data already pre-sorted and labeled for various scenarios. I've already benchmarked legal use-cases (i.e. legal speciality, a few logic-based questions, and specific document creation) with various LLMs – so would love to see what benchmarks this can produced compared to early Mistral or Llama.

Let me know what's the best way to reach out!

a2code
0 replies
6h6m

I'm interested in collaborating. For example, from the comments it occurred to me that a 128-bit SIMD register can contain 64 2-bit values. It seems straightforward that SIMD bitwise logical operations could be used in training such models.

ein0p
2 replies
1d

How is it a 1 bit LLM if 2 bits are required for each weight (and one of the 4 possible states is wasted to be able to represent 0)

ricardobeat
1 replies
21h7m

As someone else pointed out here, you can store 5 ternary values in 1 byte, 3^5 == 243.

ein0p
0 replies
20h25m

That’s still not 1 bit, and that would basically destroy whatever perf advantage you might hope to get if you want to keep the model in memory in that format rather than unpack it on load.

eigenvalue
2 replies
19h39m

Is it really so surprising that something like this works given how human brain neurons work? My admittedly basic understanding is that these operate through an all-or-nothing principle for their action potentials (firing): they either fire or they don't, based on whether the input signals reach a certain threshold. So the output is already sort of binary in biological neurons. The inputs are more like continuous values, since they are the sum of many different neurons sending signals into each neuron, but in this paper the activations are 8-bit, not binary/ternary. Can any neuroscientists here comment?

m00x
0 replies
19h33m

This isn't really how neurons work.

First of all, they operate independent of a synchronized clock, and they can also accumulate signals instead of executing on a input. Neuromorphic chips are closer to how the brain works, but they're still super early. I believe Intel has the best one with the Loihi 2.

(Not a neuroscientist but my wife is and that's what I understand from our chats)

fasa99
0 replies
17h5m

Well I think it's an interesting idea, and to add to that, the "-1" values would correspond to an inhibitory neuron!

What neurons can do though is integrate over time, so your output can be one spike, or 3 spikes very quick, same for your input, and maybe 10 quick spikes in a row is a more powerful signal than a lone spike. We know this intuitively, though, via vision, we don't see in mac-classic style black/white images, we see shades of brightness and color, indicating that at least our optic nerve is sending what amounts to an analog signal (even if encoded as binary spikes - is the spike timing not analog?)

This is not to mention all the biochemical signaling that happens, and the multitude of local neurotransmitters and global physiological/hormonal factors at play. And all that weird stuff like glial cells and astrocytes is there in the mix too.

Klipper3
2 replies
1d5h

The theoretical capacity of a binary network is 69% of the capacity of a full-weight network, so it makes sense that LLM would converge to 1-bit networks in the long term.

It's nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:

https://www.pnas.org/doi/full/10.1073/pnas.0700324104

arunk47
1 replies
20h22m

What is stopping us right now from doing this one bit networks ?

tarruda
0 replies
10h0m

I think no code was released yet

Blackthorn
2 replies
23h22m

Is there any rigorous way to answer the question of how much information (be it entropy or some other measurement) is contained in a model's weights?

riskable
1 replies
3h0m

Yes, actually: That's the entire point of the paper! The concept is that the amount of information contained in a weight like 0.00006103515625 is equivalent to 0. -0.99951172 is equivalent to -1, 1.26406236 equivalent to 1, etc. That there's no practical difference when actually utilizing the model (if trained in ternary from the start).

The paper posits (and provides evidence) that if you train a model using ternary values instead of floating point values you get equivalent (useful/practical) information. You can't take an existing model and round all the values down to `{-1,0,+1}` values but you can (re)train a model using ternary values to get the same end result (equivalent information/output).

Technically a model trained using FP16 values contains vastly more information than a model trained using ternary values. Practically though it seems to make no difference.

My prediction: Floating point models will still be used extensively by scientists and academics in their AI research but nearly all real-world, publicly-distributed AI models will be ternary. It's just too practical and enticing! Even if the ternary representation of a model is only 90% effective it's going to be so much faster and cheaper to use it in reality. We're talking about the difference between requiring a $500 GPU or a $5 microcontroller.

Blackthorn
0 replies
1h4m

I don't think you really answered my question. What's been done by the paper is show experimentally that networks don't have enough information to justify their weight precision, and that's really good and a very important result, but what I was asking was if there's a rigorous way to take an arbitrary network and determine its information content (either by itself, or compared to another network). Possibly that can be relative to its outputs.

yousif_123123
1 replies
1d4h

Any models published as well?

jonbaer
0 replies
22h2m

I really can't tell but it seems to be a continuation of this work if I read the To-Dos correctly, what do you think? Here it seems to be 1-bit on just the transformer, https://huggingface.co/shi3z/BitNetWikipedia110M

smaddox
1 replies
14h29m

Damn. Well, I guess I better hurry up and write and publish a paper on the Ternary Neural Network research that I've been doing (part-time) for the last several months, before it all gets scooped.

riskable
0 replies
3h25m

Modify your schedule, sure but do not rush it (just to beat the other folks). The first paper on any given topic may garner some 15 minutes of fame but the well-researched, boring paper is one oft-cited. Even if it isn't the first on its topic.

Be thorough and by golly, include some useful visuals! Even bad pictures and low-effort charts and graphs can vastly improve the grokability of a research paper.

Also, request assistance! Are you terrible at making charts and graphs? Ask someone to help you! For the low, low price of adding their name to the paper I'm 100% certain you can borrow an expert's time to add some dapper displays of useful information along with drastic wording and layout improvements.

The amount of papers in the wild that are just walls of jargon with completely useless, nearly-impossible-to-read charts and graphs is seemingly limitless.

Refreshing is the paper that a non-expert can read and understand! You don't have to ELI5 but well-written text and explanations are loved by all. The individual using it to gain actual knowledge will grok it from skimming and looking at the data anyway so you might as well take the time to explain some of the more complicated aspects like it's going to be read by a freshman STEM major (no need to go further back in education than that).

If you need help with grammar just paste a portion of your text into some LLM (even the small, locally-run models) and they usually do a pretty good job at finding and fixing such mistakes.

ryeguy_24
1 replies
15h45m

How does gradient descent work with these discrete ternary parameters? If you compute the partial differential for a parameter, how do you determine what to nudge the parameter when updating on back propagation? Do you only update if the "nudging amount" meets a threshold?

edflsafoiewq
0 replies
12h29m

While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.
rapatel0
1 replies
1d1h

The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.

I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}

az226
0 replies
5h32m

What about text to speech models? Do you think ternary will work?

kandu
0 replies
21h3m

Also: training neural networks by turning connections on and off, or by just flipping the sign of the weights: https://arxiv.org/abs/2006.16627

nborwankar
1 replies
16h44m

“Integer arithmetic is all you need” ? NVIDIA stock arrow up or down?

hatthew
0 replies
16h34m

if true, nvidia number go down

klysm
1 replies
23h41m

Does this mean we can compile LLMs to run on FPGAs directly?

loa_in_
0 replies
23h17m

I don't know if ternary gate arrays are a thing, but if so then yes.

karmasimida
1 replies
23h36m

This is exciting news, if the 8B numbers are true, we can already use model like Mixtral 8x7, even with a single GPU?

But further into the development, we need comparison to large model sizes. 70B might be too much to ask, but 13B should be there at least.

cjbprime
0 replies
21h6m

You could already run Mixtral on the more expensive single consumer GPUs (with 24GB VRAM) before this paper, at e.g. 3-bits per weight.

joelthelion
1 replies
1d5h

Assuming this is confirmed, what's the impact on training?

Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.

lucubratory
0 replies
12h56m

In theory it should make training a lot easier too, particularly on CPUs. But I think you'll still need reasonably expensive compute to get a model something close to the current big models, and you really can't ignore data. Data quality and quantity are both huge ingredients in model quality, at least as big as architecture. It's still non-trivial to get a good quality, large dataset, certainly out of the reach of lone hackers and most small companies.

checker659
1 replies
1d3h

If all the weights are either 1, 0 or -1, isn't this what biological neurons do?

nathan_compton
0 replies
1d3h

Not even remotely. I suppose you could kind of say that activations are boolean in the sense that neurons emit spikes, but arguably significant information is encoded in spike timing.

bilsbie
1 replies
21h17m

How would you use this in something like PyTorch? There’s no ternary data type.

edflsafoiewq
0 replies
14h15m

Widen it to a datatype it does have, like int8.

bilsbie
1 replies
21h15m

Could there be some value in recognizing areas where the model needs finer grained weights and somehow using a different data type just in certain areas?

fabiospampinato
0 replies
21h12m

It seems tough to do, besides I'm not sure what the benefit would be, with that you can't do the optimized matrix multiplication anymore, and if you need more precision presumably you can just add more neurons and/or train for longer and/or with better data.

Avisite
1 replies
23h27m

Does quantization need to be an all or nothing? with the kind of low bit models we have seen, my assumption would be that only certain weights would benefit from the extra precision. A mixture of precision with 2-bit, 3-bit, to 8-bit weights might perform well, but I am unsure if any training process could identify the weights that need the extra precision.

kromem
0 replies
18h35m

Given the weights are just mapping to a virtual network structure anyways, my guess would be that as parameter sizes increase any difference node precision might have will evaporate when trained from the ground up.

So moving to extremely high efficiency native ternary hardware like with optics is going to be a much better result than trying to mix precision in classical hardware.

We'll see, but this is one of those things that I wouldn't have expected to be true but as soon as I see that it is it kind of makes sense. If it holds up (and it probably will) it's going to kick off a hardware revolution in AI.

wenyuanyu
0 replies
1d4h

If this turns out to be true. It could indeed be a game changer... Given the advanced AI chip shortage... Also, for the chip ban on China...

wenyuanyu
0 replies
1d4h

I wonder how the training process works...

ulnarkressty
0 replies
1d5h

Take this with a grain of salt until someone reproduces it. Improvements such as these require extraordinary evidence. Not to mention extreme quantization has been tried before.

sp332
0 replies
1d5h

1-bit LLMs remind me of a random forum post I read about SACD and limitations of the 1-bit DSD audio format. https://www.audiosciencereview.com/forum/index.php?threads/d... Accumulating approximate values in one bit leads to being "constantly overloaded", with any error correction overwriting all of your real signal from the next step. I think this trinary system might leave enough room to avoid this problem.

simonvc
0 replies
17h34m

The paper talks about LLMs a lot, but would this result hold for all Transformers? Are Ternary Transformers going to make things like Whisper faster/better?

rossjudson
0 replies
15h14m

I predict Daniel Lemire will build the most efficient training and inferencing systems, close to theoretical performance limits.

nutate
0 replies
1d3h

Triggered by the use of 1-bit to describe a trit.

naasking
0 replies
1d4h

Interesting return to ternary. Effectively, each weight says only whether it's correlated (+1), uncorrelated (0), or anti-correlated (-1) with the input, and the structure of the network is the actual computation over that information.

modeless
0 replies
22h9m

Maybe a silly question but nonlinearity is important for neural nets. Wouldn't it make more sense for the three values to be e.g. (2, 0, -1) so they are not colinear?

Also, what are the prospects for FPGA implementations of this?

m3kw9
0 replies
17h16m

How much of a waste is using NVidia hardware for this?

lavp
0 replies
11h52m

What does “perform slightly better than Llama” mean exactly? A model like this needs to be trained from scratch right?

hoseja
0 replies
1d6h

Balanced ternary, my beloved.

fl0ki
0 replies
29m

Would there be value in distinguishing -0 and +0? If a 0 was quantized from a small negative or a small positive, it seems like retaining the sign is better than forgetting it.

The question remains whether the benefit and the simpler design are worth the loss of density.

fgfm
0 replies
1d

It's funny how discoveries in NLP & computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (https://arxiv.org/abs/1912.13200), which concluded as you had to suffer almost no performance drop.

Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).

farhanhubble
0 replies
16h14m

What's the benefit of using ternary encoding over just a binary representation? And if we have come so far is there potential for a more efficient algorithm than gradient descent?

elromulous
0 replies
1d2h

So for the uninitiated (me), does this mean the input is not a float (i.e. is quantized on input), such that all the math can be done with int operations?

This seems almost too good to be true.

Edit: Answering my own question, yes. The details are in the original bitnet paper: https://arxiv.org/abs/2310.11453

elijahbenizzy
0 replies
13h47m

There's an interesting mental model I've been toying with. At what point do LLMs just become circuit-shaped NNs with stochastic gradient descent backing them?

E.G. are we just determining the best program by rearranging 1s and 0s?

dr_dshiv
0 replies
3h7m

Wondering if this might have any impact on the use of quantum computers in LLM training/distillation…

brunooliv
0 replies
1d

Do the implications at a practical level mean that the size of gguf files will become smaller?

arunk47
0 replies
20h42m

Okay wait, can I train my own llm yet?

anon291
0 replies
1d

This is something that's been tried many times before. 1-bit to 2-bit models and binary NNs have a long history.

TriangleEdge
0 replies
19h27m

How do you train these? Or is it only for already trained models?

Mizza
0 replies
1d1h

I hope somebody gives this team access to the good data and a lot of crunch, I'd love to see what happens when you train the big fella.

K0IN
0 replies
1d5h

when can we expect the first ~100+ million parameter models to run on raspberry pi Pico?

Havoc
0 replies
23h29m

If true then I'm guessing this would make ASICs for this far more simple too, right?

BenoitEssiambre
0 replies
1d2h

Low bit parameters is always talked about in terms of performance benefits but I wonder if allowing the LLM to combine parameters to represent values, means it can select the resolution of each value, that is use a kind of internal scientific notation to track the uncertainty of values. More low bit parameters combined together means more precision and resolution, less can mean more uncertainty. This might allow the LLM to better calibrate the uncertainty of it's knowledge in a Bayesian way, to prevent hallucinations from the overconfidence you get from overfitting on too many bits.

Animats
0 replies
15h19m

Well, that's 2 bits, but still...

LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It's a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That's something biology can evolve.

Alifatisk
0 replies
1d5h

If this paper (especially the results on Table 4) is true, then this is a game changer!

1ba9115454
0 replies
1m

A tenary is all you need.