HN comments for: “Imprecise” language models are smaller, speedier, and nearly as accurate

13alvone

29 replies

4d11h

2024-05-31 06:49:59 UTC

I've been thinking about how far we've come with large language models (LLMs) and the challenge of making them almost perfect. It feels a lot like trying to get a spaceship to travel at the speed of light.

We’ve made impressive progress, getting these models to be quite accurate. But pushing from 90% to 99.9999999% accuracy? That takes an insane amount of data and computing power. It's like needing exponentially more energy as you get closer to light speed.

And just like we can’t actually reach the speed of light, there might be a practical limit to how accurate LLMs can get. Language is incredibly complex and full of ambiguities. The closer we aim for perfection, the harder it becomes. Each tiny improvement requires significantly more resources, and the gains become marginal.

To get LLMs to near-perfect accuracy, we'd need an infinite amount of data and computing power, which isn't feasible. So while LLMs are amazing and have come a long way, getting them to be nearly perfect is probably impossible—like reaching the speed of light.

Regardless I hope to appreciate the progress we've made but also be realistic about the challenges ahead. What do you think? Is this a fair analogy?

torginus

5 replies

4d11h

2024-05-31 07:07:58 UTC

But it still might be worth it. A 90% accurate model will only successfully complete a task consisting of 10 subtasks 0.9^10 = 35%of the time, while a 99% will do so 90% of the time making the former useless, but the latter quite useful.

CuriouslyC

4 replies

4d6h

2024-05-31 11:36:53 UTC

Yes, but a 90% accurate model that's 10x faster than a 99% can be run 3x to achieve higher accuracy while still outperforming the 99% model, for most things. In order for the math to be in the big model's favor there would need to be problems that it could solve >90% of the time where the smaller model was <50%.

torginus

1 replies

3d21h

2024-05-31 21:11:21 UTC

The problem with your premise, is that you don't necessarily know when said 90% accurate model produces the right output.

CuriouslyC

0 replies

3d3h

2024-06-01 14:40:29 UTC

You can think of multiple runs with the 90% model as fuzzy parity bits in an error correcting code, so you kind of can.

deskamess

1 replies

4d2h

2024-05-31 16:01:09 UTC

Will that 90% model give you a more accurate answer after 3 tries?

CuriouslyC

0 replies

2024-05-31 17:47:49 UTC

So far experiments say yes, with an asterisk. Taking ensembles of weak models and combining them has been shown to be able to produce arbitrarily strong predictors/generators, but there are still a lot of challenges in learning how to scale the techniques to large language models. Current results have shown that an ensemble of GPT3.5 level models can reach near state of the art by combining ~6-10 shots of the prompt, but the ensemble technique used was very rudimentary and I expect that much better results could be had with tuning.

kreyenborgi

5 replies

4d9h

2024-05-31 08:57:21 UTC

"Data" isn't an inexhaustible resource, and also isn't fungible in the way energy is. Of the thousands of languages in the world, a fair chunk don't even have writing systems, and some have very few speakers left. Many are lost forever. Now ask the best llm trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages. You can't improve on that task by adding more sentences in English or by combining with learning on other modalities.

OmegaPoint

4 replies

4d8h

2024-05-31 10:13:58 UTC

"Data" isn't an inexhaustible resource

Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759).

Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages.

If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.

“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”

CaptainOfCoit

3 replies

4d7h

2024-05-31 10:30:35 UTC

Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759).

Synthetic data might be the answer if you're fine with any data, but I haven't came across many synthetic datasets that are of high quality, and if you want high quality output from a LLM, I'm not sure Tiny Stories et al can provide that.

Here is just one example from Tiny Stories (https://huggingface.co/datasets/roneneldan/TinyStories/viewe...):

Once, there was a girl who wanted to write a story. She thought and thought about what she could write about. She felt it was too boring to just write about trees and flowers. Suddenly, an idea came to her. She decided to write about her waist. She started to write about how her waist was round, and how it jiggled when she danced. Her story was so fun and exciting! She wrote about how she liked to put a belt around her waist and how it made her feel smarter. She even wrote a rhyme about her waist: "My waist is round and jiggly, And when I dance, it's so wiggly." The girl was so proud of the story she wrote. She was no longer bored - writing about her waist was much more fun!

Hardly high quality "story", and an LLM training on data like that won't have high quality output no matter how much you train it.

Edit: Another example from Tiny Stories, just because how fun they end up being:

One day, a little boy named Jack was playing in his room. He decided to go and sit on his favourite chest. When he sat down, he noticed something unusual. The chest smelled smelly! Jack had never noticed a smelly smell before and he couldn't work out what it was. Jack's Mum heard him say 'That chest smells smelly', so she came into his room to see what was happening. When she saw the chest, she knew what was wrong. Jack's little puppy had been using the chest as a bed! His Mum scooped the naughty puppy up in her arms and took him outside. When the puppy was outside, the smelly smell went away. Jack was so relieved! He sat back down on the chest, and said 'Ahhh, much better!'

Do people really expect to be able to train on this and get high quality output? "Garbage in, garbage out", or however that goes...

HeatrayEnjoyer

1 replies

4d5h

2024-05-31 13:19:14 UTC

A smelly smell that smells... smelly.

emporas

0 replies

4d3h

2024-05-31 14:38:44 UTC

It's grammatically correct. Correct grammar despite of it being semantically nonsense, still not defined how small it can get. GPT-2's grammar was atrocious.

meat_machine

0 replies

3d8h

2024-06-01 09:48:35 UTC

Taken directly from the abstract:

This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).

In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

The point of TinyStories isn't to serve as an example of a sophisticated model, but rather to show that the emergent ability of producing coherent language can happen at smaller scales, and from a synthetic data set, no less. TinyStories is essentially the language model equivalent of a young child, and it's producing coherent language -- it's not producing grammatically correct nonsense like the famous "colorless green ideas sleep furiously" phrase from Chomsky.

but I haven't came across many synthetic datasets that are of high quality

I'm not really sure what your personal experience has to do with the viability of synthetic data; it's already been proven to be a useful resource. For example, Meta directly stated this upon the release of their Llama 3 model:

We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3. We also leveraged synthetic data to train in areas such as coding, reasoning, and long context. For example, we used synthetic data to create longer documents to train on.

https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility...

jjayj

4 replies

4d11h

2024-05-31 06:54:45 UTC

I think I agree with your analogy, but would say 99% rather than 99.99999%.

Beyond that, I'm not entirely sure what a "perfect" LLM would even be defined as.

13alvone

3 replies

4d11h

2024-05-31 06:56:09 UTC

That makes sense, and just harder and harder to get more accurate. Same as humans I supposed :)

jjayj

2 replies

4d11h

2024-05-31 07:07:14 UTC

Past 99%, what does "more accurate" mean? I think it will vary from person to person and use case to use case, which is why I personally don't foresee a world where an LLM or any form of AI/ML is ever perfectly accurate.

I'm struggling to think of any medium that has ever reached 100% accuracy, so to target that for an ML algorithm seems foolhardy

ben_w

0 replies

2024-06-01 17:37:58 UTC

I'm struggling to think of any medium that has ever reached 100% accuracy, so to target that for an ML algorithm seems foolhardy

How close to 100% is close enough?

Arithmetic and logic are close enough that cosmic rays are the limiting factor, and "computer" used to be a profession.

13alvone

0 replies

4d11h

2024-05-31 07:20:39 UTC

I agree with this. Because it does seem that if it's based on NOT 100% accurate information in terms of training, it can never return 100% accurate results. Which I guess, as humans, we don't either, but as a committee, one MAY argue we could. I'm torn lol.

Workaccount2

2 replies

4d2h

2024-05-31 15:44:46 UTC

I think you are probably right, but if humans are at 99.9% (which seems very unlikely) I don't think it will be long before you can trust a model more than a human expert.

Really though I think this line of thinking is better to revisit in 5 or so years. LLM's are still very new, and seemingly everyday new optimizations and strategies are being found. Let's at least hit a plateau before assessing limitations.

lambdaxyzw

1 replies

3d21h

2024-05-31 21:21:33 UTC

I don't think it will be long before you can trust a model more than a human expert.

You will never be able to trust a LLM more than a human expert. Because human expert will use the best available tools (for example, LLMs), will understand "what the client wants" and will put the data in the right context. At best human expert and LLM will be indistinguishable, but I really doubt it. And I think it will take a long time.

At least it's my opinion, we'll see what happens.

ben_w

0 replies

2024-06-01 17:35:49 UTC

Because human expert will use the best available tools (for example, LLMs), will understand "what the client wants" and will put the data in the right context.

You're not wrong, but when that happens, does it still count as "a human expert" doing it? A chess grandmaster is capable of using Stockfish, but it's not their victory when they do.

2OEH8eoCRo0

2 replies

4d5h

2024-05-31 12:54:14 UTC

I've said this before but (as a noob) I don't think cramming all human knowledge into a model is the correct approach. It should be trained enough to understand language so that it can then go search the web or query a database for answers.

neolefty

1 replies

4d1h

2024-05-31 16:42:26 UTC

The more certain the domain, the more that is possible. If you have a document database that you trust, great. For example a support desk's knowledge base. And especially if you have an escape valve: "Did this solve your problem? If not, let's escalate this to a human."

But if you are searching the Internet, you'll find multiple answers — probably contradictory — and the next step is to ask the model to judge among them. Now you want all the intelligence you can muster. Unless you really trust the search engine, in which case yeah a small model seems great.

regularfry

0 replies

3d21h

2024-05-31 21:05:39 UTC

Do we know that reasoning ability and inbuilt knowledge are coupled? It seems to me that having the reasoning ability sufficient to judge between search engine results might want a significantly different type of training than collecting facts.

drchickensalad

1 replies

4d5h

2024-05-31 12:32:55 UTC

Yeah LLM's are just a nontrivial stepping stone. Humans don't need to consume the entire set of worlds knowledge, repeated from thousands of different mouths coming from different angles, to be able to learn to output human like thought processes.

At some point we'll discover a mew algorithm/architecture that can actually continuously learn from its environment with limited information and still produce amazing results like us.

kjhcvkek77

0 replies

4d1h

2024-05-31 16:50:49 UTC

Well let's not forget that the large amount of information they ingest also leads to a superhuman level of knowledge though I guess for certain kinds of agents that is not really needed anyway.

CuriouslyC

1 replies

4d6h

2024-05-31 11:32:51 UTC

Yes and no. We don't need an insane amount of data to make these models accurate, if you have a small set of data that includes the benchmark questions they'll be "quite accurate" under examination.

The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.

We'd be much better off training smaller models for specific domains and training an agent that can use tools deepmind style.

ben_w

0 replies

3d1h

2024-06-01 17:26:40 UTC

The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.

My understanding is NFL only applies if the target function is chosen from a uniform distribution of all possible functions — i.e. the "everything" that NFL says you can't predict is more like "given this sequence from a PRNG (but we're not telling you which PRNG), infer the seed and the function" and less like "learn all the things a human could learn if only they had the time".

wruza

0 replies

4d9h

2024-05-31 08:48:41 UTC

There’s also a rumor that models these days employ a large “safety” parachute behind their engines all the time. Some of these get so big that models become dumber right before your eyes.

darkerside

0 replies

4d4h

2024-05-31 13:33:01 UTC

If people aren't speed of light perfect (and they're not), why could a computer be? What does perfection even mean?

Der_Einzige

20 replies

4d14h

2024-05-31 03:55:41 UTC

Quantization is never free, and you can rest assured that even the "good" quants of the best models are highly crippled compared to their unquantized versions.

The "nearly as accurate" is only on their contrived benchmarks. I've never met a quantized model that actually behaved "98%" as good as the unquantized model, and I do LLM work daily and have since well before the ChatGPT era.

rockskon

17 replies

4d14h

2024-05-31 04:05:11 UTC

I've never met an LLM that consistently behaved well at all. Quantized or unquantized.

Honestly it feels like the bulk of the industry is acting out one big LARP where they're always right around the corner from developing AGI and doing something big and amazing and....it just never materializes. Obnoxious AI agents, worthless AI-generated websites crowding out useful resources, unreliable AI search results.

The AI industry has done very well for itself selling hype. Now it needs actual good products.

hn_throwaway_99

12 replies

4d14h

2024-05-31 04:13:34 UTC

This makes 0 sense to me. I get tons of real world value and productivity benefits from AI, most specifically ChatGPT and cursor.sh.

I don't disagree that there is a ton of hype, a lot of it unwarranted, and I cringe when I see tons of companies trying to "throw AI against the wall and see if it sticks" (I personally nominate LinkedIn's AI blurbs on their feed as "most useless and annoying use of AI"). But still, I'm blown away with how much value I get from AI. It makes me a bit sad that so many of us have become so jaded that they see it as "one big LARP".

rockskon

8 replies

4d10h

2024-05-31 07:41:37 UTC

That's the thing -

I can't trust ChatGPT.

If I am searching for something I don't know the answer to and don't have the luxury of trial-and-error for the information I'm given, I can't rely on an unreliable agent like ChatGPT (or literally any LLM for that matter).

ChatGPT could be giving me a correct answer. Or it could be blowing smoke up my ass.

I don't know which it is when I'm seeing an answer from ChatGPT!

And that's the problem.

tasuki

7 replies

4d9h

2024-05-31 08:36:56 UTC

Ask questions where you can verify whether the answer is correct. Then it becomes a very useful tool.

rockskon

3 replies

4d9h

2024-05-31 08:49:53 UTC

Only if the trial-and-error process has no meaningful consequences for failure.

umanwizard

1 replies

4d2h

2024-05-31 15:40:50 UTC

Yeah. Which describes tons of stuff I encounter every day.

rockskon

0 replies

3d23h

2024-05-31 19:27:27 UTC

Which doesn't describe much of my stuff at all.

At least not my stuff where it'd be quicker to ask an AI agent for help with.

tasuki

0 replies

3d21h

2024-05-31 20:43:32 UTC

What things do you work on that have meaningful consequences for failure? Woodworking? LLMs indeed might not be for you! I write code and there's zero consequences for failure as long as I don't push it in version control...

latexr

2 replies

4d3h

2024-05-31 15:08:12 UTC

Ask questions where you can verify whether the answer is correct.

If you’re going to verify (and you should), might as well skip the asking step.

umanwizard

0 replies

4d2h

2024-05-31 15:40:15 UTC

There are lots of things that are easier to verify than to figure out how to do from scratch.

hn_throwaway_99

0 replies

4d2h

2024-05-31 15:30:45 UTC

Except that's a million times slower. I'll often ask it to generate some complicated SQL queries for me (usually when I need to get info from system tables, less so my own data), and it's pretty trivially easy to verify the output. It would take me much, much longer if I had to write these queries from scratch.

Capricorn2481

2 replies

4d11h

2024-05-31 06:36:15 UTC

It's dismissive to call it jaded. I don't think you, me, or the people you're talking about are intellectually different. We just don't like what AI makes even if we think it's impressive.

hn_throwaway_99

1 replies

3d23h

2024-05-31 18:37:26 UTC

Fair point, my apologies. I guess what I'm saying is that I agree that AI is an imperfect tool, but saying it's all hype feels like throwing out the baby with the bathwater.

rockskon

0 replies

3d17h

2024-06-01 01:21:31 UTC

LLM is just about all hype. I've found AI song generators to be entertaining but that's about it.

umanwizard

0 replies

4d11h

2024-05-31 07:00:40 UTC

I’m blown away when people say stuff like this. GPT-4 makes me substantially more productive almost every day. It’s like we live in a different reality.

throw46365

0 replies

4d13h

2024-05-31 05:07:29 UTC

I had that feeling the other day, reading quotes from that OpenAI executive:

https://archive.is/LpDuJ

She said the executives told the board they "didn't think he was the right person to lead the company to AGI…”

The way I read it, it sort of sounded like you could substitute “AGI” with “Shangri-La”.

It’s always going to be just down the road, but they are sort of emotionally convinced that is where they are headed.

lucubratory

0 replies

4d7h

2024-05-31 11:16:22 UTC

Honestly it feels like the bulk of the industry is acting out one big LARP where they're always right around the corner from developing AGI and doing something big and amazing and....it just never materializes.

It's been less than two years since ChatGPT released.

devbent

0 replies

4d13h

2024-05-31 05:04:02 UTC

Now it needs actual good products.

I asked chatgpt for recommended food pairings for a soup I made today. It did a great job.

Chatgpt also helped me debug a home automation issue I'd been having for over a year with some smart lights.

I find uses for chatgpt every day.

segmondy

1 replies

4d6h

2024-05-31 12:24:17 UTC

This is false. I have tested Q8s over f16 and the raw weights and pretty much seen no difference. I strictly run Q8s, I only keep the raw weights if I plan to do some fine tunes.

staticman2

0 replies

2024-05-31 18:15:41 UTC

This matches the general consensus at the LocalLLama subreddit that Q8 is basically indistinguishable from the full model.

mlsu

18 replies

4d13h

2024-05-31 04:57:18 UTC

I don't believe that quantization comes for free. Someone made the observation that llama3 models quantize "worse" than llama2 models do -- they suffer in quality from quantization far more.

My intuition is that a model which is undertrained suffers less from quantization, because the training process has not utilized each weight to its full potential. One of the key findings with llama, and why it punches above its weight for its size, is that they trained it for longer on a much larger dataset then was "optimal" according to the literature up to that point.

Putting two and two together, it seems that:

small model, lots of data, long training > large model + quantization

That basically, quantization is a lossy shortcut to the tail of training long. Amount and quality of data is, as always, the most important part about all of this.

kir-gadjello

12 replies

4d12h

2024-05-31 05:59:16 UTC

While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd

For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.

1. https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_...

d13

7 replies

4d11h

2024-05-31 06:57:38 UTC

How does it compare against unequalised Llama 3 8B at 16fp? I’ve been using that locally and it’s almost replaced GPT4 for me. Runs in about 14GB of VRAM.

LordDragonfang

5 replies

3d21h

2024-05-31 21:09:32 UTC

What is your use case where you find it comparable to gpt4?

stavros

4 replies

3d19h

2024-05-31 22:52:06 UTC

For creative tasks, for example, Llama 3 is much better. GPT-4 is very sterile, Llama is much more whimsical, and has a lot more character.

CaptainOfCoit

3 replies

3d18h

2024-05-31 23:55:14 UTC

For creative tasks

What, specifically, are you asking of these LLMs? "creative tasks" can be anything from programming to cooking recipes, so a tiny bit more specificality would be appreciated :)

stavros

2 replies

3d17h

2024-06-01 00:35:29 UTC

Sorry, I meant making up stuff, e.g. creative writing.

LordDragonfang

0 replies

2d10h

2024-06-02 08:19:33 UTC

I've used pretty much every major LLM out there for a specific type of creative writing, and none of them are as good at it as GPT4 with the exception of maybe Claude (Opus is actually probably even better regarding the sterility). Llama 3, even 70b, is definitely not better by any measure of actual quality - it's more random, at best.

CaptainOfCoit

0 replies

3d6h

2024-06-01 12:02:24 UTC

So like making presentations? Composing poems? Writing novellas/novels? Technical articles?

iwontberude

0 replies

4d3h

2024-05-31 14:31:56 UTC

llama3 is nowhere near gpt4, though it is cool

endofreach

2 replies

4d1h

2024-05-31 16:56:58 UTC

surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime

I am too dumb for all of this ML stuff. Can you explain what exactly that means & why it's surprising?

m1el

1 replies

2024-05-31 17:35:20 UTC

Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits). So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.

endofreach

0 replies

2024-06-01 17:35:52 UTC

Oh, nice. Thank you for this explanation. Now i think i get quantisation. Very well explained for someone like me. Thank you a lot!

renewiltord

0 replies

4d2h

2024-05-31 15:31:24 UTC

Holy wow. Thank you for this. Very cool. I’ve been using 8b for things it might be worth using 70b for.

Xcelerate

1 replies

4d5h

2024-05-31 12:52:39 UTC

Has anyone attempted to compress network size further by extracting the symmetry invariants of the network—e.g., those that correspond to the permutation invariance of node shufflings that leave the DAG unchanged?

I did a rough calculation, and as the precision of the scalar weights decreases, the information content of the specific network permutation becomes a much higher percentage of its overall size.

Depending on the particular neural network architecture, there may be other symmetries beside the symmetric group that also represent compressible redundancies.

renonce

0 replies

4d4h

2024-05-31 14:05:35 UTC

For a 4096x4096 matrix its symmetry group has size 4096!. The Stirling’s approximation gives ln(4096!)=4096ln(4096)-4096 which is about 10~11 bits per 4096 numbers. This is less than 0.003 bit per parameter saved.

segmondy

0 replies

4d6h

2024-05-31 12:22:22 UTC

depends on how far. Q8's are pretty much on par with f16/f32 weights. Q5/Q6's if they drop off barely drop below 2%. folks are running Q2, Q3, etc due to being GPU poor, of course it's going to be terrible. When you want to run a 70B model that's originally 140gb and all you have is 16gb of vram.

raverbashing

0 replies

4d8h

2024-05-31 09:48:48 UTC

Quantization might need a 'relaxing' or 'normalizing' step before to improve results, it would be an interesting research area

Quantize each block or vector as a unit, bound to some metric instead of simply quantizing the whole thing all at once.

It's an interesting topic https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

acchow

0 replies

4d9h

2024-05-31 09:14:46 UTC

Quantization isn’t a shortcut. It’s a test.

More training leads to better compression. Trimming precision off the integers reveals that there was more compression that could be had.

aussieguy1234

14 replies

4d15h

2024-05-31 02:43:53 UTC

When it needs to do more precise numerical calculations, perhaps, like a human, it could just use a calculator?

CuriouslyC

8 replies

4d6h

2024-05-31 11:42:41 UTC

People are trying to make these monolithic god models right now because everyone's chasing OpenAI's example with ChatGPT, but that approach is going to run out of steam for a lot of reasons. We're eventually going to end up with an agent model that does just what you say, recognizes specialized problem types such as math and calls the appropriate tool, such as a calculator or symbolic computation engine.

zozbot234

3 replies

4d5h

2024-05-31 12:41:26 UTC

Many models support function calling/RAG already, these are very similar features from a structural POV. But of course it's harder to train a model for such tasks, compared to just fitting some existing training set.

CuriouslyC

2 replies

4d4h

2024-05-31 13:31:37 UTC

I'm not sure having the LLM as the top level piece is the right approach though. Async is the direction we want to go, and LLMS are inherently synchronous. Additionally, LLMs are cumbersome to train and tune, having an agent that calls smaller models would unlock the power of the community to build customized models for specific needs.

szundi

1 replies

4d4h

2024-05-31 13:44:27 UTC

We would like to have something like us. My daughter is just 3.5yrs old, she is inherently hard to train too, decades left. However, we find ourselves quite good.

CuriouslyC

0 replies

4d4h

2024-05-31 13:52:26 UTC

You're assuming you're a monolith, but in fact you have many subnetworks in your brain. The neocortex acts to process information from other parts of the brain, and it itself comprised of a network of interacting modules.

szundi

1 replies

4d4h

2024-05-31 13:38:14 UTC

ChatGPT routinely does this already

CuriouslyC

0 replies

4d4h

2024-05-31 13:53:34 UTC

Yes, in a slow, synchronous, limited way.

meroes

1 replies

4d2h

2024-05-31 15:53:33 UTC

Hmm. And agent model for prompts we will ask K-12 math questions to? What are ultimately math questions often take lots of context to trudge through. It takes a solid core of natural language as well. I think there is always the need for a god model, because we want to speak in natural language as that’s easiest for us.

I’m not a theoretician nor scientist, but train models on math and logical reasoning. Some very simple word problems contains tons of context, like family trees, implicit information, etc. and the process of step by step reasoning, not just the final answer, requires even more natural language processing.

CuriouslyC

0 replies

4d1h

2024-05-31 16:39:05 UTC

You don't need a god model to do that. You need a model that understands natural language really well and is good at identifying/extracting subtasks within a prompt, and then it needs to be able to dispatch those subproblems to the right place.

kromem

3 replies

4d15h

2024-05-31 02:54:27 UTC

That's not really a concern.

If you have a trillion parameter 8-bit fp network or a trillion parameter 1.5-bit ternary network, based on the scaling in Microsoft's paper the latter will actually perform better.

A lot of the current thinking is that the nodes themselves act as superpositions for a virtualized network in a multidimensional vector space, so precision is fairly arbitrary for the base nodes and it may be that constraining the individual node values actually allows for a less fuzzy virtualized network by the end of the training.

You could still have a very precise 'calculator' feature in the virtual space no matter the underlying parameter precision, and because each parameter is being informed by overlapping virtual features, may even have less unexpected errors and issues with lower precision nodes.

xwolfi

2 replies

4d13h

2024-05-31 04:36:58 UTC

Yup, your response makes me think they should just use a calculator, like everyone.

imtringued

1 replies

4d11h

2024-05-31 06:34:42 UTC

I don't know what you mean. They already use the GPU as a calculator.

ben_w

0 replies

4d10h

2024-05-31 07:56:34 UTC

I believe you four are talking about different things; the models are executed on very good "calculators" (if you want to call the GPUs that), but themselves are not very good at being used as calculators.

LLMs are sufficiently good hammers that people see everything as a nail, then talk about how bad they are at driving screws.

darkerside

0 replies

4d4h

2024-05-31 13:36:29 UTC

It should just be able to program. A programming language is the ultimate verbal-to-math translation layer.

Tangent but... programming languages aren't designed for computers. Computers are perfectly happy with assembly or even binary. Programming languages are designed for humans, not just so we can see what others have done, but so that we can understand what we ourselves have done. We give the variable a name because we can't remember 0x0010101; but the computer remembers them both just fine.

hiddencost

12 replies

4d16h

2024-05-31 02:20:06 UTC

Nope nope nope.

Any time we find more efficiency, we can trade it for more quality by doing more compute. We'll always use as much compute as we can afford, until we stop getting quality gains that are worth the added cost.

XorNot

8 replies

4d15h

2024-05-31 02:32:09 UTC

Isn't this article just about an optimization though, sans the title?

I don't much care for all the "oh but the energy usage" claims in most tech things: it's all electricity, and it's all fungible. It usually seems to roll out as a proxy for "I don't like this thing".

Like even with cryptocurrency, there were a lot of people mistaking the issue of scalability - namely that "as a store of value" crypto would consume incredible amounts of other resources (and a lot of people got stuck trying to figure out how somehow "a hash" could be reclaimed for useful resources) to do less then alternatives, with "the energy usage itself is the problem".

Finding optimizations for LLMs is good because it means we can build cheaper LLMs, which means we can build larger LLMs then we otherwise could for some given constraint, which means we can miniaturize (or in this case specialize) more capable hardware. The thing which really matter is, can the energy usage be meaningfully limited to a sensible scaling factor given the capability that makes them useful?

Because environmentally, I can install solar panels to do zero-carbon training (and if LLMs are as valuable as they're currently being priced, this is a no-brainer - if people aren't lying about solar being "cheaper then fossil fuels").

dartos

6 replies

4d15h

2024-05-31 02:35:52 UTC

It’s good that we can build cheaper LLMs, but the problem companies guzzling energy for LLM training won’t use less energy, they’ll just have better models

kaba0

5 replies

4d13h

2024-05-31 04:36:02 UTC

That energy is still on the order of a household’s yearly electricity, not that of Argentina as cryptos were, and that’s just for training.

Inferring is much cheaper and arguably provides quite a lot of value (even though I also think it is overhyped), for very little energy consumption, probably more is lost due to inefficiency for any physical product.

dartos

4 replies

4d6h

2024-05-31 12:13:21 UTC

I got news for you.

The 1M GPUs that meta purchased running at full (or less probably) load 24/7 is more than the energy of a single household.

The energy cost is in training, not inference.

kaba0

3 replies

4d3h

2024-05-31 15:08:52 UTC

Which is the same order of magnitude? Also, how often do they train from scratch?

dartos

2 replies

3d23h

2024-05-31 18:45:34 UTC

I don’t understand your first question, sorry.

But for the second:

Llama 1, 2, and 3 all have different architectures and needed to be trained from scratch. Llama 1 was released February 2023.

Same training story for openAI’s Sora, dalle, and 4o. All of mistral’s models Mamba, Kan, and Each version of rwkv (they’re on 6 now)

Not that this list is a result of survivor bias. It’s only looking at their published models too. Not the probably 1000s of individual training experiments that go into producing each model.

kaba0

1 replies

2d12h

2024-06-02 05:44:22 UTC

Which is still absolutely nothing compared to something like youtube’s servers, which is absolutely nothing compared to something like the food industry.

Like, if a couple of millions of people can use chatgpt in the manner they do today, would it matter if a house’s yearly energy budget was used up for that? Or 10?

dartos

0 replies

1d3h

2024-06-03 14:43:00 UTC

I guess because training AI doesn’t use as much energy as the largest energy consumers on the planet, we just shouldn’t even worry about it.

eru

0 replies

4d15h

2024-05-31 03:03:20 UTC

Like even with cryptocurrency, there were a lot of people mistaking the issue of scalability - namely that "as a store of value" crypto would consume incredible amounts of other resources (and a lot of people got stuck trying to figure out how somehow "a hash" could be reclaimed for useful resources) to do less then alternatives, with "the energy usage itself is the problem".

To be fair, technology-wise they mostly solved this problem via proof-of-stake.

From an individual point of view you still expense enormous resources as a miner / validator in a proof-of-stake system. It's just that now the resources come in the form of lost opportunity costs for your staked tokens (eg staked Ethereum).

But from aggregated perspective of society, staked Ethereum is essentially free.

That has some parallels to how acquiring regular money, like USD, is something individuals spend a lot of effort on. But for the whole of society, printing USD is essentially free.

Because environmentally, I can install solar panels to do zero-carbon training (and if LLMs are as valuable as they're currently being priced, this is a no-brainer - if people aren't lying about solar being "cheaper then fossil fuels").

There's still opportunity costs for that energy. Unless you have truly stranded electricity that couldn't be used for anything else.

Finding optimizations for LLMs is good because it means we can build cheaper LLMs, which means we can build larger LLMs then we otherwise could for some given constraint, which means we can miniaturize (or in this case specialize) more capable hardware. The thing which really matter is, can the energy usage be meaningfully limited to a sensible scaling factor given the capability that makes them useful?

I agree with that paragraph. It's all about trade-offs. If we can shift the efficiency frontier, that's good. Then people can decide whether they want cheaper models at the same performance, or pay the same energy-price for better models, or a combination thereof. Or pay more energy for even better model

kazinator

1 replies

4d14h

2024-05-31 03:39:16 UTC

Cost is basically time here. "How much compute we can afford" is really "how long we are willing to wait for the result".

trashtester

0 replies

4d5h

2024-05-31 13:08:17 UTC

Then again, time * size_of_cluster = cost.

If you train on a cluster that costs >$1M/day to operate, the wait time is likely to be a smaller concern than the financial cost, unless you're REALLY in a hurry to beat some competitor.

shikon7

0 replies

4d15h

2024-05-31 03:11:42 UTC

Yes, it seems the more efficient use we have for compute, the more valuable compute is to us, and the more compute we will be able to afford.

bambax

6 replies

4d8h

2024-05-31 10:21:35 UTC

I'm surprised the comments here are so negative / cynical. Of course "quantization doesn't come for free": it's a trade off.

High precision models are very expensive to run: they require expensive hardware and lots of energy.

Isn't it possible, or even likely, that low-precision models would be good enough for many tasks?

kromem

1 replies

4d7h

2024-05-31 10:50:29 UTC

Post-training quantization doesn't come for free, but the pretraining on constrained precision weights actually counterintuitively results in a performance increase per parameter as the number of parameters grows in the ternary BitNet paper.

Even if there was zero efficiency gain in ternary weights, large models should probably be trained on a network of precision limited weights from here on out given the research so far.

I suspect it relates to each weight relating to multiple 'features' in the network. The greater the precision, the more room it gives for competing features to compromise on node values that aren't best for either feature instead of reorganizing the feature mapping to avoid conflicts.

trashtester

0 replies

4d5h

2024-05-31 13:02:40 UTC

The number of bits used per weight during training, could be included in the regularization, perhaps?

For instance, one could extend dropout regularization to several levels, where each weight could have random chances to include the most significant 2-16 bits part of the time (and still 0 part of the time), and where the impact on the gradient of having fewer bits could be used to tune the ideal number of bits for each weight.

Then one could add L1 regularization for the total number of bits used to squeeze the total down down to whatever size one aims for.

k__

0 replies

4d7h

2024-05-31 11:14:13 UTC

lol, just like with humans.

jononor

0 replies

4d3h

2024-05-31 15:08:40 UTC

Several years of ML research for CNNs indicate at least that one can do very well with 8 bit integers. Such quantization is basically standard now for any deployment outside of GPU (where 8bit isn't any faster anyway due to the hardware).

demosthanos

0 replies

4d4h

2024-05-31 13:31:35 UTC

I'm surprised the comments here are so negative / cynical.

And by "the comments" I'm assuming you mean the top comment [0] (and maybe its replies)? The rest don't really come off as negative at all, and you quoted directly from that top comment.

FWIW, I don't find that comment either negative or cynical. It starts out with that sentence you quoted, but it goes on to make a very interesting point about quantization most likely working best for models which are undertrained—models which store less information than their architecture size would suggest. That's a very valid point that I found insightful and interesting, not cynical.

[0] https://news.ycombinator.com/item?id=40531638

az09mugen

0 replies

4d7h

2024-05-31 10:35:37 UTC

I totally agree with your statement. I consider LLMs imprecise in any case, it's not a perfect/exact science, just statistics. I only use LLMs for tasks where an error margin can be allowed.

In that perspective I'm totally fine with using a 5 MB LLM like this one : https://neuml.hashnode.dev/train-a-language-model-from-scrat...

WiSaGaN

6 replies

4d13h

2024-05-31 04:42:58 UTC

This seems saying that the 1bit model is a bit better than GPTQ Q2. However, I find there are few situations where you would want to use GPTQ Q2 in the first places. You would want to run the F16 version if you want quality, and if you want to have a sweet spot, you usually find something like Q5_K_M of the biggest model you can run.

qeternity

5 replies

4d12h

2024-05-31 05:53:16 UTC

Nobody is running llama.cpp in production…

luke-stanley

2 replies

4d11h

2024-05-31 06:42:21 UTC

What do you mean? What makes you think that?

qeternity

1 replies

4d1h

2024-05-31 17:25:53 UTC

Because for anything other than CPU inference it is inferior to TensorRT and vLLM.

luke-stanley

0 replies

2024-06-02 18:11:30 UTC

Are you sure about that? what are the benchmarks it fails on when setup like-for-like with GPU drivers? Even still, it can do constrained grammar and is really easy to setup with wrappers like Ollama and the Python server too. I struggled to find support for that with other inference engines - though things change fast!

szundi

1 replies

4d4h

2024-05-31 13:45:57 UTC

Probably more often than not

qeternity

0 replies

4d1h

2024-05-31 17:27:15 UTC

Highly doubtful.

If anyone is running it here in a production setting, please post and prove me wrong.

bunderbunder

5 replies

4d3h

2024-05-31 14:58:30 UTC

They keep calling them 1-bit LLMs, but they're really 1-trit LLMs. If you can have 3 states, it's not a BInary uniT, it's a TRInary uniT.

I don't think that this is just a nit, it implies a real mismatch between how these models work, and how modern computing hardware works. People have built ternary computers in the past, but to the best of my knowledge nobody's made a serious attempt at it in half a century.

You can always use two bits to store a ternary value, of course. But then it wouldn't be a 1-bit LLM, now, would it? And that doubling in compute resources required would make a tangible difference in how one has to think about potential efficiency improvements. Also, ternary tritwise logic implemented on binary hardware is unlikely to be anywhere near as efficient as binary bitwise logic on hardware that was specifically engineered for the purpose. This leaves me thinking that the research teams involved continuing to refer to these as 1-bit LLMs must be interpreted as knowingly engaging in dishonest marketing tactics.

orlp

0 replies

4d3h

2024-05-31 15:11:04 UTC

Perhaps in the future we will see MsoTriState neural networks: https://learn.microsoft.com/en-us/dotnet/api/microsoft.offic...

For those unaware, the MsoTriState is fairly self-explanatory: it is a tri-state boolean type with five possible values.

neolefty

0 replies

4d1h

2024-05-31 16:45:35 UTC

The article describes both. The 1,0,-1 networks are "1.58 bits" — that is, log2(3). But yeah it mostly focuses on 1-bit networks.

freeone3000

0 replies

4d3h

2024-05-31 15:03:11 UTC

It’s stored in and calculated using a floating-point value, and uses log(2) bits. So it’s ~1.2 bits with float packing over 32m parameters.

corysama

0 replies

4d1h

2024-05-31 16:47:31 UTC

Not long ago I managed to get someone working on sub-1-bit models to come out of the woodwork.

https://news.ycombinator.com/item?id=39865855

IanCal

0 replies

4d3h

2024-05-31 15:07:17 UTC

I don't think it's quite so clear cut, and the papers and names are often a bit more precise anyway.

BitNet is (I think) actually one bit per parameter. BitNet 1.58b isn't, but then to be fair isn't describing itself as a 1 bit llm. I'm less sure but it seems OneBit is 1 bit. One of the processes mentioned in the article is a mix of one and two bits for the weights.

Decabytes

5 replies

4d6h

2024-05-31 11:33:38 UTC

I feel like we are getting closer and closer to a finding the Goldilocks Llm, that with some smarter training and the right set of parameters, will get us close to got 3.5 turbo performance but at a size, cost, and time effort that is significantly lower and that is runnable locally.

Combine that with what seems like every chip adding a neural engine and it feels like we are in the early days of high performance graphics again. Right now we are in the unreal engine voodoo era, where graphics cards/neural engines are expensive/rare. But give it a few generations and soon we can assume that even standard computers will have pretty decent NPUs and developers will be able to rely on that for models

segmondy

1 replies

4d6h

2024-05-31 12:19:30 UTC

yeah, it's called phi-3-medium-4k-instruct

https://huggingface.co/microsoft/Phi-3-medium-4k-instruct

moffkalast

0 replies

4d6h

2024-05-31 12:22:13 UTC

It's not called anything until the lmsys leaderboard ranks it. Microsoft's blatant benchmark overfitting on Phi-2 makes for very little trust in what they say about performance. As a man once said, fool me once, shame on you, fool me twice-can't get fooled again.

djeastm

1 replies

4d6h

2024-05-31 12:01:10 UTC

3.5 turbo performance

Is this the level of performance people are relying upon? While I've always been impressed with the technology itself, it's only starting with GPT 4 that I think it approaches adequate performance.

Decabytes

0 replies

4d3h

2024-05-31 15:05:52 UTC

For the work that I do (which is mostly rag with a little bit of content generation) GPT 3.5-turbo-0125 with a 16k context window is the sweet spot for me. I started using the api when it was only a 4k context window, so the extra breathing room provided by the 16k context window feels cavernous. Plus the fact that it's $0.50 per 1Million tokens means that I can augment my software with LLM capabilities at a cost that is attractive to me as a small time developer.

The way I rationalize it is that using 3.5-turbo is like programming on an 8-bit computer with Kilobytes of Ram, and gpt-4o is like programming on a 64bit computer with a 4080 ti and 32gb of ram. If I can make things work on the 8-bit system, they will work nicely on the more powerful system.

moffkalast

0 replies

4d6h

2024-05-31 11:56:01 UTC

3.5-turbo performance wasn't very good though, and according to API statistic analysis it's a Nx7B model so it's already rather small. Ultimately Llama-3-8B is already better in all measurable metrics except multilingual translation, but that's not saying much.

usgroup

4 replies

4d9h

2024-05-31 09:25:54 UTC

Is there some work on how small a model with some specific epsilon perplexity could theoretically be? Given a fixed architecture and a fixed dataset, I presume there is a minimal number of parameters required for optimal representation.

renonce

3 replies

4d7h

2024-05-31 10:39:47 UTC

If you are referring to what is theoretically possible with arbitrary computation in the model, it's called Kolmogorov complexity and it's not computable.

usgroup

2 replies

4d7h

2024-05-31 11:08:43 UTC

With a fixed architecture, and a fixed dataset, as mentioned. So, a specific kind of neural network, and a fixed dataset.

renonce

0 replies

4d1h

2024-05-31 16:57:51 UTC

The closest research would be the Chinchilla scaling laws, which estimates the final loss as a function of number of parameters and tokens. Set the number of tokens to infinity would give a good estimate of minimum achievable loss.

jononor

0 replies

4d3h

2024-05-31 15:21:43 UTC

You can estimate it empirically. However large changes in model parameters/capacity tends to interact with hyperparameters, so would want to do runs with multiple values of hyperparameters. And training processes give noisy results, so might want to do multiple repetitions. And each run may take several GPU days. So even a small experiment of 10 repetitions X 10 hyperparameters X 10 model sizes takes several thousand GPU days. But there are many papers from the large labs that do such.

And the whole result is also conditional on the optimization/training process used. Which is an area where we have no reason to think that we are optimal... So we can do studies with practical results (given sufficient money), but we are far from being able to identify the actual maximums available.

renonce

3 replies

4d8h

2024-05-31 10:28:37 UTC

I study LLM quantization and I have surveyed GPTQ and QuIP# and lots of quantization algorithms (specifically PTQ, post-training quantization) to develop my own, and my experience has led me to become extremely skeptical of many of the papers.

I've seen lots of headlines like "1-bit quantization" (including this one and https://arxiv.org/abs/2310.16795). What I've found in this space is that the headlines can often be intentionally misleading about what is actually achieved. If you read closer the abstract of this paper, it claims 8.41 perplexity on LLaMA2-70B at 1 bit, which is a HUGE decrease from 3.120 perplexity in FP16 and they will never mention that in the headline. Even LLaMA2-7B at INT8 achieves 5.677 perplexity with half of storage place (better with LESS space and LESS training). Some claim 1.58-bit quantization (each weight is either -1, 0 or 1) but in practice require very small group sizes, which means one or two extra FP16 numbers for every 64 weights and that means another 0.5 bit, so it's actually 2-bit quantization. And every quantization algorithm can claim they make language models smaller, speedier, and use less energy, so there's nothing special about these.

Here are the key metrics that I suggest checking when comparing quantization schemes:

* Perplexity. Note that it also depends on the dataset (either WikiText2 or C4, WikiText2 numbers are usually lower than C4) and context size (1024, 2048 or 4096, higher context sizes usually means less perplexity). Dataset and context size must match to make a meaningful comparison.

* Quantization bits. Many algorithms claiming 2-bit or 1-bit quantization has lots of extra parameters elsewhere, such as grouping. Download the quantized version of the model, check its file size, multiply by 8 and divide by the number of parameters. That gets you the ACTUAL quantization bits.

* Performance. Weights may need to be dequantized during inference which could introduce overhead. Some libraries have custom matmul kernel for dequantization that achieves performance close to FP16, others can be slower. Check its generation speed and inference speed.

Newer architectures such as Ampere contains INT8 cores which may make quantized version even faster than FP16, I haven't tried out yet.

There is also a lot of misleading comparisons in this space. Some methods like GPTQ only provide vector-matrix multiplication kernels, which means only a single token can be generated and batched inference (which is needed for generating initial KV cache, or for serving multiple users) can be much slower. If an algorithm claims a 3x speedup for something, check if they refer to single stream latency or multiple stream throughput. Some of that speedup comes from running a model on 2 cards instead of 5 cards without specifying if the cards have NVLink configured (you shouldn't run inference on multiple cards without NVLink, or you should expecet huge slowdown simply because of using 5 cards).

* Base model. Pick a STRONG base model like Llama-2-7b or Llama-3-8b etc. Not an undertrained model like SwitchTransformer etc which may have lots of redundant parameters in itself.

My personal favourite remains QuIP# (https://github.com/Cornell-RelaxML/quip-sharp). It lacks in the "performance" part as its matrix multiplication performance isn't on par yet but there is room for improvement, and it wins every other metric. And sad news: it's very likely we won't have practical 1-bit LLMs, never ever. We are reaching the end game between 2.5~4 bits. By "practical" I mean it should beat 3-bit LLMs with 3x less parameters or 2-bit LLMs with half as many parameters. There is a Shannon limit to quantization whatever methods you use.

kromem

1 replies

4d7h

2024-05-31 11:01:30 UTC

Completely agree on PTQ, but curious on your thoughts for QAT, specifically BitNet 1.58 - in that paper it looks like parameter to parameter the constrained precision weights had improved perplexity vs floating point weights, particularly as the model size increased.

While I'd love to see it scaled up to at least ~50B models, it looks like limited weight precision might actually offer improved network optimization over unconstrained weights for pretraining.

Do you think that work is misrepresenting the gains, or that QAT is a different beast where quantization isn't as much a tradeoff as a potential net gain across the board?

renonce

0 replies

4d6h

2024-05-31 12:04:50 UTC

Can't speak for QAT as I haven't yet dived into that area. I've quickly skimmed the BitNet and BitNet 1.58 paper. I think achieving comparable performance with a Llama model with the same number of parameters is impressive but unfortunately it seems they didn't release the training code so I can only tell from their paper. Fortunately they did talk about training details in the BitNet paper (not in BitNet 1.58 so I assume they remain the same):

Mixed precision training. While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [LSL+21], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.

In this case there are two areas to optimize for: training efficiency and inference efficiency.

If I understand correctly, it stores the weights, gradients and second-moment estimates in FP32 like every other mixed-precision training (the Gopher paper has details on why storing them in FP32 is important), and quantized weights are used in forward pass. What I'm not sure is whether latent weights are used in backward pass, and my instinction is that the "Straight-through estimator" requires high-precision latent weights so they may still be needed. Training FLOPS can be roughly estimated as 6 FLOP per parameter per token, where 2 is forward pass, 2 is gradient computation and 2 is gradient accumulation (see https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-la...). If only forward pass is quantized, this means only 1/3 of all FLOPS are optimized (and even then it has to be accumulated in FP32). So I'm skeptical of the gains in training efficiency here, and I can't find the numbers (how much energy or how much time is used for training, compared to regular FP16 mixed precision training? The papers boast inference energy savings which makes me even more skeptical of training energy savings)

For quantization efficiency, while QAT can certainly avoid the quantization step, PTQ methods are very cheap (usually <24 hours on RTX 4090 for Llama-2-70b) so I consider the cost of the quantization step negligible. There is not much difference in inference efficiency gains as PTQ and QAT can quantize to the same format. For final accuracy, unfortunately there is a lack of comparison between QAT and PTQ of fp16 models, and PTQ has the advantage of not requiring access to the original dataset, so I think it's very hard to make a fair comparison here but it's also likely the only area where QAT has actual gains compared to best PTQ methods.

regularfry

0 replies

4d4h

2024-05-31 14:27:04 UTC

Just on your very last point, I think you've nailed why a 1-bit quant of a bigger LM can't beat 3-bit quants of an LM a third the size, if what you mean is that more extreme compression of a LM is more likely to introduce harmful artefacts, so you need a better quantisation method to produce 1-bit than you do at 3-bit to end up with a model with the same information content retained.

What I don't think that tells us anything about is directly trained 1-bit LMs versus 3-bit LMs, because in that case there's no compression step to introduce quantisation artefacts. There might be an analogous training data size argument but it's not clear to me that there needs to be: a 3X parameter 1-bit LLM and a 1X parameter 3-bit LLM ought to be equivalent in terms of their information capacity.

light_hue_1

3 replies

4d13h

2024-05-31 04:35:19 UTC

The performance loss from BiLLM is disastrous. It's basically useless. No one would ever want to use the resulting models. They hide their main results in the appendix: Table 8. page 15. https://arxiv.org/pdf/2402.04291

I won't go over the entire table in detail, but PIQA, BoolQ, HellaSwag, and WinoGrande should in the mid-to-high 70s for LLaMa2-7B. They drop that to 58, 62, 32, and 51. There are 700M parameter models that perform much better.

What they should have reported is effective number of parameters. Does LLaMa2-7B with their quantization method outperform a model that has amount of computation but uses that compute with say.. 16-bit quantization? If the answer is no, and it seems like it very clearly is, then the method is wholly worthless. Just use a smaller model to begin with.

The BitNet paper is better. But for some reason they only consider very small models. It's the obvious question and in their FAQ they don't provide a solid answer to it. Despite having all of the compute resources of MS. They could have easily run this experiment in the past year; I'm suspicious.

imtringued

2 replies

4d11h

2024-05-31 06:43:44 UTC

BiLLM is about post training quantization and BitNet trains models from scratch. You do realize that one of those is going to give significantly worse results and the other is going to be significantly more expensive, well into the millions of dollars?

light_hue_1

1 replies

4d8h

2024-05-31 09:35:02 UTC

There's no mathematical reason why doing quantization after would be worse than training from scratch. That's nonsense.

There's also no practical reason. Training quantized networks is often harder! This is why people quantize after the fact or do distillation.

Nor is there any reason why we won't find some projection of weights onto the BitNet manifold.

If it was published by academics I'd believe the cost argument.

This was published by MS. They can run this experiment trivially. I have friends at MS with access to enough compute to do it in days.

Either the authors ran it and saw it doesn't work or they're playing with us. Not a good look. The reviewers shouldn't have accepted the paper in this state without an explanation of why the authors can't do this.

This is the question that determines if this work matters or is useless. Publishing before knowing that isn't responsible on anyone's part.

anon373839

0 replies

4d6h

2024-05-31 11:38:59 UTC

After Llama 3, does this paper’s result seem so far-fetched? That 8B parameter model showed that most of what the frontier models “know” can be represented much more compactly. So why couldn’t it be represented at low precision?

boringg

3 replies

4d15h

2024-05-31 03:07:23 UTC

Easier just to deploy more renewable energy

adtac

1 replies

4d15h

2024-05-31 03:09:44 UTC

Like often, both will probably happen

boringg

0 replies

4d5h

2024-05-31 13:08:03 UTC

Right - it obviously isn't a binary option you can pursue both in unison.

I would want to continue the statement that we are still early innings on renewable energy -- and let's keep deploying it rapidly to manage increased compute demand.

alex_duf

0 replies

4d8h

2024-05-31 09:33:10 UTC

or is it cheaper to optimize a model to crush the competition on price?

roschdal

2 replies

4d13h

2024-05-31 05:07:50 UTC

"Nearly accurate" as in "incorrect"

CGamesPlay

1 replies

4d12h

2024-05-31 05:32:19 UTC

Don't worry, the LLMs were only "fairly accurate" (as in "incorrect") to begin with.

akira2501

0 replies

4d9h

2024-05-31 09:11:23 UTC

The lesson being... We spent billions for incremental gains masquerading as tectonic shifts. Whoops.

jandrese

2 replies

4d16h

2024-05-31 02:20:06 UTC

I'm starting to feel like the Thinking Machines CM-1 was 40 years ahead of its time.

chuckadams

1 replies

4d15h

2024-05-31 02:37:37 UTC

It had the most wonderful blinkenlights too! :)

deepfriedchokes

0 replies

4d15h

2024-05-31 03:15:01 UTC

One of my favorites, CM-1 t-shirt:

https://www.tamikothiel.com/cm/cm-tshirt.html

brcmthrowaway

2 replies

4d15h

2024-05-31 03:18:33 UTC

Are there any 1-bit LLMs available on GitHub?

lhl

0 replies

4d14h

2024-05-31 04:08:50 UTC

BitNet Implementations:

* https://huggingface.co/1bitLLM/bitnet_b1_58-large

* https://github.com/Oxen-AI/BitNet-1.58-Instruct

* https://github.com/nkotak/1.58BitNet

See some followups that has some training advice: https://github.com/microsoft/unilm/tree/master/bitnet

hughesjj

0 replies

4d14h

2024-05-31 03:49:20 UTC

Heck they have some zero bit llm's

https://github.com/kelseyhightower/nocode

chx

1 replies

4d8h

2024-05-31 09:34:54 UTC

Nearly as accurate? I guess zero is close to zero.

https://hachyderm.io/@inthehands/112006855076082650

You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

Alas, that does not remotely resemble how people are pitching this technology.

Last5Digits

0 replies

4d8h

2024-05-31 09:49:43 UTC

Are you going to spam this same link in every single thread about LLMs on HN? People have provided good arguments refuting whatever you're trying to say here, but you just keep posting the same thing while not engaging with anyone.

celltalk

1 replies

4d2h

2024-05-31 15:42:37 UTC

You all run on 2-bit (DNA) and you all seem pretty stable to me. You run your brain with couple of watts per day and add big numbers without much hassle.

neolefty

0 replies

4d1h

2024-05-31 16:46:46 UTC

DNA could also be considered 6-bit, since codons are 3 base pairs: https://www.genome.gov/genetics-glossary/Codon

sleepybrett

0 replies

3d22h

2024-05-31 20:26:33 UTC

smaller bucket of shit just as effective at getting you stinky as a large bucket of shit

macawfish

0 replies

3d17h

2024-06-01 01:19:21 UTC

I know there are 1-bit experiments out there but from what I can tell it's BitNet 1.58 that's really exciting.

So why do people keep focusing on "1-bit" when the whole reason 1-trit models are so successful in the first place just might have everything to do with the ternary weights and the symmetry they're able to encode?

I'm not sold on the suggestion that "imprecise language models" in general are "nearly as accurate" when it might actually be that ternary weights are more precise in a totally difference sense: in that they just might be capturing a minimal representation of the very symmetries responsible for making all those parameters effective in the first place.

lavelganzu

0 replies

22h34m

2024-06-03 19:54:28 UTC

Make a tiny language model and then use it for autocorrect & autocomplete. Current autocorrect & autocomplete on smartphones is so bad that it spawns countless jokes. This use case doesn't require a large language model able to write whole paragraphs -- it only needs a model that's just barely big enough to make contextually appropriate suggestions of one or a few words.

gslin

0 replies

2024-05-31 18:27:44 UTC

This makes me watch https://en.wikipedia.org/wiki/11001001 again.

JoeAltmaier

0 replies

4d2h

2024-05-31 16:05:52 UTC

Lots of folks have the intuition that, 'slightly worse' LLM models means unacceptable rates of nonsense answers.

What's an acceptable rate? Is it 99%? 99.9%?

The closer it gets to 99.999% good answers, the more damaging the wrong ones become. Because people have been trained, too. Trained to trust the answers, which makes them lazy and vulnerable to lies.

Gaazrukk

0 replies

4d5h

2024-05-31 12:53:10 UTC

.999936565