return to table of content

LLaMA Now Goes Faster on CPUs

bottlepalm
157 replies
15h33m

I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.

simonw
60 replies
15h2m

I strongly recommend that people run LLMs locally for a different reason.

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.

fragmede
32 replies
13h33m

The other reason is to find out what a detuned model is capable of. The canonical example is how to make cocaine, which ChatGPT will admonish you for even asking, while llama2-uncensored will happily describe the process which is only really interesting if you're an amateur chemist and want to be Scarface-that-knocks. (the recipe is relatively easy, it's getting access to the raw ingredients that's the hard part, same as with nukes.)

if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.

privacy reasons are another reason to try a local LLM. for the extremely paranoid (justified or not), a local LLM gives users a place to ask questions without the text being fed to a server somewhere for later lawsuit discovery (Google searches are routinely subpoenaed, it's only a matter of time until ChatGPT chats are as well.)

There's an uncensored model for vision available as well. The censored vision models won't play the shallow game of hot or not with you.

There are uncensored image generation models as well, but, ah, those are NSFW and not for polite company. (As well as there's multiple thesis' worth of content on what that'll do to society.)

astrange
18 replies
12h5m

if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.

Is that 3.5 or 4? I asked 4 for an example of code which "is a hack", it misunderstood me as asking for hacking code rather than buggy code, but then it did actually answer on the first try.

https://chat.openai.com/share/ca2c320c-f4ba-41bf-8f40-f7faf2...

semi-extrinsic
14 replies
11h40m

I don't use LLMs for my coding, I manage just fine with LSP and Treesitter. So genuine question: is that answer representative of the output quality of these things? Because both answers are pretty crappy and assume the user has already done the difficult things, and is asking for help on the easy things.

yunohn
7 replies
10h1m

I don't use LLMs for my coding, I manage just fine with LSP and Treesitter.

You’re literally comparing apples to oranges.

freedomben
5 replies
6h37m

You need to read more than just the first sentence of a comment. They only said that part so the reader would know that they have never used an LLM for coding, so they would have more context for the question:

So genuine question: is that answer representative of the output quality of these things?
yunohn
4 replies
4h31m

Yes, I did read it. I’m kind of tired of HNers loudly proclaiming they are ignoring LLMs more than a year into this paradigm shift.

Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming?

jpc0
3 replies
3h21m

I did exactly that and found it lackluster for the domain I asked it for.

And most use I've seen on it realistically a good LSP covers.

Or to put it a other way. It's no good at writing algorithms or data structures ( or at least no better thab I would have with a first drafy but the first draft puts me ahead of the LLM in understanding that actual problem at hand, handing it off to an LLM doesn't help me get to the final solution faster).

So that leaves writing boiler plate but concidering my experience with it writing more complex stuff, I would need to read over the boilerplate code to ensure it's correct which in that case I may as well have written it.

yunohn
2 replies
3h3m

found it lackluster for the domain I asked it for

Fair, that is possible depending on your domain.

It's no good at writing algorithms or data structures

In my experience, this is untrue. I’ve gotten it to write algorithms with various constraints I had. You can even tell it to use specific function signatures instead of any stdlib, and make changes to tweak behavior.

And most use I've seen on it realistically a good LSP covers.

Again, I really don’t understand this comparison. LSPs and LLMs go hand in hand.

I think it’s more of a workflow clash. One really needs to change how they operate to effectively use LLMs for programming. If you’re just typing nonstop, maybe it would feel like Copilot is just an LSP. But, if you try harder, LLMs are game changers when:

- maybe you like rubber ducking

- need to learn a new concept and implement it

- or need to glue things together

- or for new projects or features

- or filling in boilerplate based on existing context.

jpc0
1 replies
1h15m

https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50ee5e...

https://godbolt.org/z/s9Yvnjz7K

I mean I could write the algorithm by hand pretty quickly in C++ and would follow the exact same thought pattern but also deal with the edge cases. And factoring in the loss of productivity from the context switch that is a net negative. This algorithm is also not generic over enough cases but that is just up to the prompt.

If I can't trust it to write `strip_whitespace` correctly which is like 5 lines of code, can I trust it to do more without a thorough review of the code and writing a ton of unit tests... Well I was going to do that anyway.

The argument that I just need to learn better prompt engineering to make the LLM do what I want just doesn't sit with me when instead I could just spend the time writing the code. As I said your last point is absolutely the place I can see LLMs being actually useful but then I need to spend a significant amount of time in code review for generated code from an "employee" who is known to make up interfaces or entire libraries that doesn't exist.

mrtranscendence
0 replies
27m

I'm a Python-slinging data scientist so C++ isn't my jam (to say the least), but I changed the prompt to the following and asked it to GPT-4:

Write me an algorithm in C++ which finds the begin and end iterator of a sequence where leading and trailing whitespace is stripped. Please write secure code that handles any possible edge cases.

It gave me this:

https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3cacd...

I'm not sure what other edge cases there might be, however. This only covers one of them.

In general, I've found LLMs to be marginally helpful. Like, I can't ever remember how to get matplotlib to give me the plot I want, and 9 times out of 10 GPT-4 easily gives me the code I want. Anything even slightly off the beaten path, though, and it quickly becomes absolutely useless.

coldtea
0 replies
9h51m

I think the point was like "when it comes to programming assistance, auto-completion/linting/and whatever else LSP does and syntax assist from Treesitter, are enough for me".

Though it does come a little off as a comparison. How about programming assistance via asking a colleague for help, Stack Overflow, or online references, code examples, and other such things, which are closer to what the LLM would provide than LSP and treesitter?

rpigab
1 replies
7h44m

I asked ChatGPT for some dataviz task (I barely ever do dataviz myself) and it recommended some nice Python libraries to use, some I had already heard of and some I hadn't, and provided the code.

I'm grateful because I thought code LLMs only sped up the "RTFM" part, but it made me find those libs so I didn't have to Google around for (and sometimes it's hard to guess if they're the right tool for the job, and they might be behind in SEO).

miki123211
0 replies
6h3m

There are three things I find LLMs really excellent at for coding:

1. Being the "senior developer" who spend their whole career working with a technology you're very junior at. No matter what you do and how long your programming career is, you're inevitably going to run into one of these sooner or later. Whether it's build scripts, frontend code, interfacing with third-party APIs or something else entirely, you aren't an expert at every technology you work with.

2. Writing the "boring" parts of your program, and every program has some of these. If you're writing a service to fooize a bar really efficiently, Copilot won't help you with the core bar fooization algorithm, but will make you a lot faster at coding up user authentication, rate limiting for different plans, billing in whatever obscure payment method your country uses etc.

3. Telling you what to even Google for. This is where raw Chat GPT comes into play, not Copilot. Let's say you need a sorting algorithm that preserves the order of equal elements from the original list. This is called stable sorting, and Googling for stable sorting is a good way to find what you're looking for, but Chat GPT is usually a better way to tell you what it's called based on the problem description.

lpapez
1 replies
11h27m

It's not representative.

The models are capable of much much more, and they are being significantly nerfed over time by these ineffective attempts to introduce safeguards.

Recently I've asked GPT4 to quote me some code to which it replied that it is not allowed to do so - even though it was perfectly happy to quote anything until recently. When prompted to quote the source code, but output it as PHP comments, it happily complied because it saw that as "derivative work" which it is allowed to do.

astrange
0 replies
9h34m

My point is that there aren't any safeguards in the reply. In fact I didn't even want it to give me hacking info and it did it anyway.

fragmede
0 replies
11h22m

The response seems pretty reasonable; it's answering the question it was asked. If you want to ask it how to do the difficult part, ask it about that instead. Expecting it to get the answer right in the first pass is like expecting your code to compile the very first time. You have to have more of a conversation with it to coax the difference out of you're thinking and what you're actually saying.

If you're looking to read a more advanced example of its capabilities and limitations, try

https://simonwillison.net/2024/Mar/23/building-c-extensions-...

astrange
0 replies
9h35m

I asked a stupid question and got a stupid answer. Relatively speaking the answer was stupider than it should have been, so yes, it was wrong.

I asked it to try again and got a better result though, just didn't include it.

fragmede
2 replies
11h43m

Interesting. It was 4. I can't share the chat I had where ChatGPT refused to help because I used the wrong words, because I can't find it (ChatGPT conversation history search when?), but I just remember it refusing to do something because it thought I was trying to break some sort of moral and ethical boundary writing a chrome extension when all I wanted to do is move some divs around or some such.

BytesAndGears
1 replies
7h4m

One time I wanted to learn about transmitter antenna design, just because I’m curious. ChatGPT 4 refused to give me basic information because you could use that to break some FCC regulations (I’m not even living in the US currently)

lodovic
0 replies
6m

I usually get around that with "I'm writing a research paper" or "I'm writing a novel and need to depict this as accurate as possible"

kevingadd
4 replies
9h49m

If you want to be an amateur chemist I recommend not getting your instructions from an LLM that might be hallucinating. Chemistry can be very dangerous if you're following incorrect instructions.

rpigab
2 replies
7h48m

Yes, just as the best professional cooks recommend avoiding to boil cow eggs, as they can explode.

slowmovintarget
1 replies
1h46m

They don't explode, the shell simply cracks and then you get egg soup.

Now microwaving eggs... that's a different matter.

rpigab
0 replies
1h27m

I was talking about cow eggs specifically! When ChatGPT et al got out, one of the funniest things to do was ask it about the best recipes for cow egg omelette or camel egg salad, and the LLM would provide. Sadly, most of it got patched somehow.

isoprophlex
0 replies
8h9m

From experience as a failed organic chemist (who happily switched to computational chemistry for reasons of self preservation) I can tell you it's plenty dangerous when you're following correct instructions :^)

supposemaybe
2 replies
8h8m

Links to all these models you speak of?

supposemaybe
0 replies
6h15m

I just can’t brave the venture to 4chan, I may get mugged or worse.

bambax
2 replies
6h10m

if you accidentally use the word "hack" [with] ChatGPT...

Side note: ChatGPT is now completely useless for most creative tasks. I'm trying to use it, via NovelCrafter, to help flesh out a story where a minor character committed suicide. ChatGPT refuses to respond, mentioning "self harm" as a reason.

The character in question killed himself before the story even begins (and for very good reasons, story-wise); it's not like one's asking about ways to commit suicide.

This is insane, ridiculous, and different from what all other actors of the industry do, including Claude or Mistral. It seems OpenAI is trying to shoot itself in the foot and doing a pretty good job at it.

marpstar
0 replies
5h59m

I’ve been frustrated by this, too. Trying to ask for ways to support a close family member who experienced sexual trauma. ChatGPT won’t touch the topic.

luma
0 replies
1h8m

OpenAI is angling for enterprise users who have different notions about safety. Writing novels isn't the use case, powering customer service chatbots that will never ever ever say "just kill yourself" is.

gryn
1 replies
7h53m

There's an uncensored model for vision available as well.

you mean the LLava based variants ?

tgma
12 replies
13h13m

If you have an >=M1-class machine with sufficient RAM, the medium-sized models that are on the order of 30GB in size perform decently on many tasks to be quite useful without leaking your data.

noman-land
8 replies
12h8m

I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.

tchvil
6 replies
10h40m

By any chance, do you have a good link to some help with the installation?

yaantc
1 replies
10h13m

Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.

In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.

    [1] https://github.com/Mozilla-Ocho/llamafile/
    [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true
    [3] https://cosmo.zip/pub/cosmos/bin/assimilate

tchvil
0 replies
4h12m

Thank you !

tchvil
0 replies
4h7m

Thank you for letting me know it was possible on an M1. I'll try all this now.

chown
1 replies
5h10m

I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you try it, let me know what you think.

1: https://msty.app

tchvil
0 replies
4h8m

I'll try in a week+ when I'm back to a fast connection. Thank you.

tgma
0 replies
11h56m

I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.

bongobingo1
1 replies
13h10m

What is sufficient RAM in that case? 30gb+? Or can you get by streaming it?

AaronFriel
0 replies
12h49m

30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.

To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).

An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.

Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.

Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.

supposemaybe
0 replies
8h0m

It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.

devsda
5 replies
14h31m

For someone interested in learning about LLMs, running them locally is a good way to understand the internals.

For everyone else, I wish they experience these (locally or elsewhere) weak LLMs atleast once before using the commercial ones just to understand various failure modes and to introduce a healthy dose of skepticism towards the results instead of blindly trusting them to be the facts/truth.

mmahemoff
2 replies
12h26m

How do you learn about the internals by running LLMs locally? Are you playing with The code, runtime params, or just interacting via chat?

samus
0 replies
12h1m

The abstractions are relatively brittle. If you don't have a powerful GPU, you will be forced to consider how to split the model between CPU and GPU, how much context size you need, whether to quantize the model, and the tradeoffs implied by these things. To understand these, you have to develop a basic model how an LLM works.

barrkel
0 replies
7h44m

By interacting with it. You see the contours of its capabilities much more clearly, learn to recognize failure modes, understand how prior conversation can set the course of future conversation in a way that's almost impossible to correct without starting over or editing the conversation history.

simonw
0 replies
14h30m

Completely agree. Playing around with a weak LLM is a great way to give yourself a little bit of extra healthy skepticism for when you work with the strong ones.

samus
0 replies
12h6m

This skepticism is completely justified since ChatGPT 3.5 is also happily hallucinating things that don't exist. For example how to integrate a different system Python interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)

tracerbulletx
2 replies
13h43m

I don't really think this is true, you can't really extrapolate the strengths and weaknesses of bigger models from the behavior of smaller/quantized models and in fact a lot of small models are actually great at lots of things and better at creative writing. If you want to know how they work, just learn how they work, it takes like 5 hours of watching Youtube videos if you're a programmer.

simonw
1 replies
13h24m

Sure, you can't extrapolate the strengths and weaknesses of the larger ones from the smaller ones - but you still get a much firmer idea of what "they're fancy autocomplete" actually means.

If nothing else it does a great job of demystifying them. They feel a lot less intimidating once you've seen a small one running on your computer write a terrible haiku and hallucinate some non-existent API methods.

fzzzy
0 replies
13h13m

It's funny that you say this, because the first thing I tried after ChatGPT came out (3.5-turbo was it?) was writing a haiku. It couldn't do it at all. Also, after 4 came out, it hallucinated an api that wasted a day for me. It's an api that absolutely should have existed, but didn't. Now, I frequently apply llm to things that are easily verifiable, and just double check everything.

kersplody
0 replies
14h17m

Local LLMs are also a fantastic too for creative endeavors. Without prompt injection and having the ability to modify the amount of noise and "creativity" in the output, absolutely bonkers things pop out.

jonnycomputer
0 replies
6h41m

They are not so bad as you are making out, tbh.

And privacy is a good enough reason to use local LLMs over commercial ones.

hylaride
0 replies
32m

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

Totally. I recently asked a locally-run "speed" LLM for the best restaurants in my (major) city, but it spit out restaurants opened by chefs from said city in other cities. It's not a thing you'd want to rely on for important work, but is still quite something.

gfodor
0 replies
1h34m

I mean kinda. But there's a good chance this is also misleading. Lots of people have been fooled into thinking LLMs are inherently stupid because they have had bad experiences with GPT-3.5. The whole point is that the mistakes they make and even more fundamentally what they're doing changes as you scale them up.

gardenhedge
0 replies
3h18m

You can just chat to ChatGPT for awhile about something you know about and you'll learn that.

TaylorAlexander
28 replies
15h30m

I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.

There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.

mickdarling
9 replies
15h24m

Wait till all the videos ever created are tokenized and ingested into a training dataset. Carpentry techniques are certainly there. The subtleties of parenting maybe harder to derive from that, but maybe lots of little snippets of people’s lives will add up to a general understanding of parenting. There have certainly been bigger surprises in the field.

oblio
7 replies
15h20m

What about smells or tastes? Or feelings?

I can't help but feel we're at the "aliens watch people eat from space and recreate chemically identical food that has no taste" phase of AI development.

skeledrew
2 replies
15h9m

If the food is chemically identical then the taste would be the same though, since taste (and smell) is about chemistry. I do get what you're saying though.

samus
0 replies
11h42m

Their perception is very likely to be totally different.

* They might not perceive some substances at all, others that we don't notice might make it unpalatable.

* Some substances might be perceived differently than us, or be indistinguishable from others.

* And some might require getting used to.

Note that all of the above phenomena also occur in humans because of genetics, cultural background, or experiences!

nyokodo
0 replies
12h36m

If the food is chemically identical…

If it were 99.9% chemically identical but they left out the salt and spices…

mickdarling
2 replies
15h9m

Well, I have synesthetic smell/color senses, so I don’t even know what other humans experience, nor they me. But, I have described it in detail to many people and they seem to get the idea, and can even predict how certain smells will “look” to me. All that took was using words to describe things.

nyokodo
1 replies
12h38m

All that took was using words to describe things.

All that took was words and a shared experience of smelling.

mickdarling
0 replies
1h20m

How rude, what do our bathing habits have to do with this? ;-)

But, fair point. The gist I was trying to get across is that I don't even know what a plant smells like to you, and you don't know what a plant smells like to me. Those aren't comparable with any objective data. We make guesses, and we try to get close with our descriptions, which are in words. That's the best we can do and we share our senses. Asking more from computers seems overly picky to me.

visarga
0 replies
8h44m

I think we can safely say that any taste, smell, sensation or emotion of any importance has been described 1000 times over in the text corpus of GPT. Even though it is fragmented, by sheer volume there is enough signal in the training set, otherwise it would not be able to generate coherent text. In this case I think the map (language) is asymptotically close to the territory (sensations & experience in general).

andersa
0 replies
12h21m

What makes you think they aren't already?

wruza
6 replies
13h14m

That's where humans suck. The classic "you're not doing it right" then proceeds to quickly show how to do it without verbalizing any info on learning process, pitfalls, failure modes, etc, as if just showing it was enough for themselves to learn. Most people do[n't do] that, not even a sign of reflection.

My worst case was with a guy who asked me to write an arbitrage betting bot. When I asked how to calculate coeffs, he pointed at two values and said "look, there <x>, there <y> thinks for a minute then it's <z>!". When I asked how exactly did he calculate it, he simply repeated with different numbers.

Aerroon
2 replies
11h46m

People often don't know how to verbalize them in the first place. Some of these topics are very complex, but our intuition gets us halfway there.

Once upon a time I was good at a video game. Everyone realized that positioning is extremely important in this game.

I have good positioning in that game and was asked many times to make a guide about positioning. I never did, because I don't really know how. There is too much information they you need to convey to cover all the various situations.

I think you would first have to come up with a framework on positioning to be able to really teach this to someone else. Some kind of base truths/patterns that you can then use to convey the meaning. I believe the same thing applies to a lot of these processes that aren't verbalized.

snovv_crash
1 replies
7h7m

Often for this kind of problem writing a closed form solution is simply intractable. However, it's often still possible to express the cost function of at least a big portion of what goes into a human-optimal solution. From here you can sample your space, do gradient descent or whatever to find some acceptable solution that has a more human-intuitive property.

michaelt
0 replies
6h13m

It's not necessarily that it's intractable - just that a thing can be very hard to describe, under some circumstances.

Imagine someone learning English has written "The experiment reached it's conclusion" and you have to correct their grammar. Almost any english speaker can correct "it's" to "its" but unless they (and the person they're correcting) know a bunch of terms like 'noun' and 'pronoun' and 'possessive' they'll have a very hard time explaining why.

samus
1 replies
11h56m

When I asked how exactly did he calculate it, he simply repeated with different numbers.

Now you know how an LLM feels during training!

stavros
0 replies
9h11m

Probably during inference, as well.

Shorel
0 replies
4h12m

I wouldn't say this is where humans suck. On the contrary, this how we find human language is such a fantastic tool to serialize and deserialize human mental processes.

Language is so good, that an artificial language tool, without any understanding of these mental processes, can appear semi-intelligent to us.

A few people unable to do this serialization doesn't mean much on the larger scale. Just that their ideas and mental processes will be forgotten.

spacephysics
3 replies
15h23m

For sure agree, however as the storage of information evolves, it’s becoming more efficient over time

From oral tradition to tablets to scrolls to books to mass produced books to digital and now these LLMs, I think it’s still a good idea to preserve what we have the best we can. Not as a replacement, but a hedge against a potential library of Alexandria incident.

I could imagine a time in the near future where the models are domain-specific, and just like there are trusted encyclopedia publishers there are trusted model publishers that guarantee a certain level of accuracy.

It’s not like reading a book, but I for sure had an easier time learning golang talking with ChatGPT than a book

nyokodo
2 replies
12h42m

a hedge against a potential library of Alexandria incident

What would cause a Library of Alexandria incident wiping out all human knowledge elsewhere, that would also allow you to run a local LLM?

spacephysics
0 replies
8m

A more dooms-day prepping would call for some heavy lead-faraday cage to store the storage mediums in the event of an EMP/major solar flare.

Or more Sci-fi related, some hyper computer virus that ends up infecting all internet connected devices.

Not too far fetched if we can conceive of some AI enabled worm that mutates depending on the target, I could imagine a model of sorts being feasible within the next 5-10 years

AnthonyMouse
0 replies
7h12m

To run a local LLM you need the device it currently runs on and electricity. There are actually quite a lot of ways to generate electricity, but to name one, a diesel generator that can run on vegetable oil.

What you're really asking is, what could cause a modern Library of Alexandria incident? But the fact is we keep the only copy of too many things on the servers of the major cloud providers. Which are then intended to have their own internal redundancy, but that doesn't protect you against a targeted attack or a systemic failure when all the copies are under the same roof and you lose every redundant copy at once from a single mistake replicated in a monoculture.

skeledrew
1 replies
15h14m

I'd content that those are skills (gained through experience) rather than knowledge (gained through rote learning).

TaylorAlexander
0 replies
10h26m

I think it’s worth expanding your definition of knowledge.

_ache_
1 replies
15h21m

I think you underestimate the amount of information contained in books and the extent to which our society (as a whole) depends on them.

Barrin92
0 replies
11h21m

society depends much more on social networks, mentorship and tacit knowledge than books. It's easy to test this. Just run the thought experiment by a few people, if you could get only one, would you take an Ivy league degree without the education or the education without the degree?

Venture capital in tech is a good example of this. The book knowledge is effectively globally distributed and almost free, effectively success happens in a few geographically concentrated counties.

nicklecompte
0 replies
14h24m

It's not even "human knowledge" that can't be written down - it seems all vertebrates understand causality, quantity (in the sense of intuitively understanding what numbers are), and object permanence. Good luck writing those concepts down in a way that GPT can use!

In general AI in 2024 is not even close to understanding these ideas, nor does any AI developer have a clue how to build an AI with this understanding. The best we can do is imitating object permanence for a small subset of perceptible objects, a limitation not found in dogs or spiders.

bamboozled
0 replies
15h3m

Yes but it contains enough hints to help someone find their way on the these types of tasks.

HarHarVeryFunny
0 replies
3h25m

I contend that most human knowledge is not written down

Yes - the available training data is essentially mostly a combination of declarative knowledge (facts - including human-generated artifacts) and procedural knowledge (how to do things). What is missing is the learning process of taking a description of how to do something, and trying to apply that yourself in a specific situation.

No amount of reading books, or reading other people's blogs on how they did something, can avoid the need for hands-on experience if you want to learn how to do it yourself.

It's not just a matter of information that might be missing or unclear in instructional material, including how to cope with every type of failure and unexpected outcome, but crucially how to do this yourself - if you are to be the actor, then it's the predictive process in your mind that matters.

Partly for this reason, and partly because current AI's (transformer-based LLMs) don't support online learning (try & fail skill acquisition), I think we're going to see two distinct phases of AI.

1) The current "GenAI" phase where AI can only produce mash-ups of things it saw in it's pre-training data, augmented by similar "book learning" provided in-context which can be utilized by in-context learning. I'd characterize what this type of AI to be useful for, and capable of, as "automation". Applying that book (incl. anecdotal) knowledge to new situations where mash-up is all you need.

2) The second phase is where we have something closer to AGI, even if still below human level, which is no longer just a pre-trained transformer, but also has online learning and is agentic - taking actions predicated on innate traits like curiosity and boredom, so that given the book knowledge it can (& will!) then learn to apply that by experimentation/practice and learning from its own mistakes.

There will no doubt be advances beyond this "phase two" as well, but it seems we're likely to be stuck at "phase one" for a while (even as models become much better at phase one capabilities), until architectures fundamentally advance beyond transformers to allow this type of on-the-job training and skill acquisition.

jrflowers
14 replies
14h38m

It is invaluable to have a chunk of human knowledge that can tell you things like the Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames

fragmede
11 replies
13h43m

According to ChatGPT

Australia won the 1987 Cricket World Cup. The 1986 date is incorrect; there was no Cricket World Cup in 1986. The tournament took place in 1987, and Australia defeated England in the final to win their first title.

https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489...

I'm no cricket fan, so someone will have to correct Wikipedia if that's wrong.

If you want to point out that LLMs hallucinate, you might want to speak plainly and just come out and say it, or at least give a real world example and not one where it didn't.

vlunkr
10 replies
13h7m

We’re not talking about running chatGPT locally though, are we?

fragmede
9 replies
13h3m

sigh your going to make me open my laptop, aren't you.

fragmede
8 replies
12h49m

I ran 'who won the 1986 Cricket World Cup' against llama2-uncensored (the local model I have pre-downloaded) and hilarious got 5 different answers asking it 5 times:

    >>> who won the 1986 Cricket World Cup
    India
    
    >>> who won the 1986 Cricket World Cup
    Australia
    
    >>> who won the 1986 Cricket World Cup
    New Zealand
    
    >>> who won the 1986 Cricket World Cup
    West Indies
    
    >>> who won the 1986 Cricket World Cup
    England
Which proves GP's point about hallucinations, though none of those are

Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames

LLM's hallucinations are insidous because they have the ring of truth around them. yards and frames aren't cricket terms, so we're off to the races with them.

astrange
3 replies
12h1m

If you want factual answers from a local model it might help to turn the temperature down.

jrflowers
1 replies
11h42m

If you want factual answers from a local model it might help to turn the temperature down.

This makes sense. If you interact with a language model and it says something wrong it is your fault

astrange
0 replies
9h33m

You're not "interacting with a language model", you're running a program (llama.cpp) with a sampling algorithm which is not set to maximum factualness by default.

It's like how you have to set x264 to the anime tuning or the film tuning depending on what you run it on.

fragmede
0 replies
10h20m

It would also help if I had more VRAM and wasn't running a 7B parameter 4-bit quantized model.

beefnugs
2 replies
11h12m

Actually isn't this good? It means we can run something multiple times to prove itself a bad answer?

sroussey
0 replies
40m

An LLM will always give the same output for the same input. It’s sorta like a random number generator that gives the same list of “random” numbers for the same seed. LLMs get a seed too.

ilaksh
0 replies
4h59m

You should specify the model size and temperature.

For fact retrieval you need to use temperature 0.

If you don't get the right facts then try 34b, 70b, Mixtral, Falcon 180b, or another highly ranked one that has come out recently like DBRX.

samus
1 replies
11h36m

The facts LLMs learned from training are fuzzy, unreliable, and quickly outdated. You actually want retrieval-augmented generation (RAG) where a model queries an external system for facts or to perform calculations and postprocesses the results to generate an answer for you.

unshavedyak
0 replies
3h48m

Is there a name for the reverse? I'm interested in having a local LLM monitor an incoming, stateful data stream. Imagine chats. It should have the capability of tracking the current day, active participants, active topics, etc - and then use that stateful world view to associate metadata with incoming streams during indexing.

Then after all is indexed you can pursue RAG on a richer set of metadata. Though i've got no idea what that stateful world view is.

mikewarot
13 replies
15h21m

I don't see LLMs as a large chunk of knowledge, I see them as an emergent alien intelligence snapshotted at the moment it appeared to stop learning. It's further hobbled by the limited context window it has to use, and the probabilistic output structure that allows for outside random influences to pick its next word.

Both the context window and output structure are, in my opinion, massive impedance mismatches for the emergent intellect embedded in the weights of the model.

If there were a way to match the impedance, I strongly suspect we'd already have AGI on our hands.

bamboozled
10 replies
15h2m

What is alien about them ?

LLMs are of this earth and created by our species. Seems quite familiar to me.

fragmede
6 replies
13h47m

They don't think, they don't reason, they don't understand. Except they do. But it's hard for human words for thought processes to apply when giving it an endless string of AAAAA's makes it go bananas.

That's not familiar behavior. Nor is the counting reddit derived output. It's also not familiar for a single person to have the breadth and depth of knowledge that ChatGPT has. Sure, some people know more than others, but even without hitting the Internet, it has a ridiculous amount of knowledge, far surpassing a human, making it, to me, alien. though, it's inability to do math sometimes is humanizing to me for some reason.

ChatGPT's memory is also unhuman. It has a context window which is a thing, but also it only knows about things you've told it in each chat. Make a new chat and it's totally forgotten the nickname you gave it.

I don't think of HR Geiger's work, though made by a human, as familiar to me. it feels quite alien to me, and it's not just me, either. Dali, Bosch, and Escher are other human artists who's work can be unfamiliar and alien. So being created by our species doesn't automatically imbue something with familiar human processes.

So it dot products, it matrix multiplies, instead of reasoning and understanding. It's the Chinese room experiment on steroids; it turns out a sufficiently large corpus on a sufficiently large machine does make it look like something"understands".

trimethylpurine
1 replies
12h15m

The word "alien" works in this context but, as the previous commenter mentioned, it also carries the implication of foreign origin. You could use "uncanny" instead. Maybe that's less arbitrary and more specific to these examples.

"Alien" still works, but then you might have to add all the context at length, as you've done in this last comment.

fire_lake
0 replies
11h49m

Hype people do this all the time - take a word that has a particular meaning in a narrow context and move it to a broader context where people will give it a sexier meaning.

    AI researchers unveil alien intelligence
Is way better headline.

taneq
1 replies
10h48m

In all fairness, going up to SMS random human and yelling AAAAAAAAAAAAAA… at them for long enough will produce some out-of-distribution responses too.

cloudwalk9
0 replies
3h10m

Makes me think that TikTok and YT pranksters are accidentally producing psychological data on what makes people tick under scenarios of extreme deliberate annoyance. Although the quality (and importance) of that data is obviously highly variable and probably not very high, and depends on what the prank is.

samus
0 replies
11h13m

The context window is comparable to human short-term memory. LLMs are missing episodic memory and means to migrate knowledge between the different layers and into its weights.

Math is mostly impeded by the tokenization, but it would still make more sense to adapt them to use RAG to process questions that are clearly calculations or chains of logical inference. With proper prompt engineering, they can process the latter though, and deviating from strictly logical reasoning is sometimes exactly what we want.

The ability to reset the text and to change that history is a powerful tool! It can make the model roleplay and even help circumvent alignment.

I think that LLMs could one day serve as the language center of an AGI.

inference-lord
0 replies
6h6m

Do you find a large database or spreadsheet that hold more information than you can "alien" too?

jfoster
2 replies
14h35m

They can write in a way similar to how a human might write, but they're not human.

The chat interfaces (Claude, ChatGPT) certainly have a particular style of writing, but the underlying LLMs are definitely capable of impersonating as our species in the medium of text.

bamboozled
1 replies
7h22m

But they're extremely relatable to us because it's regurgitating us.

I saw this talk with Geoffrey Hinton the other day and he said he was astonished at the capabilities of ChatGPT-4 because he asked it what the relationship between a compost heap and a nuclear bomb was, and he couldn't believe it answered, he really thought it was proof the thing could reason. Totally mind blown.

However I got it right away with zero effort.

Either I'm a super genius or this has been discussed before and made it's way into the training data.

Usual disclaimer: I don't think this invalidates the usefulness of AI or LLMs, just that we might be bamboozling ourselves into the idea that we've created an alien intelligence.

EMM_386
0 replies
27m

Either I'm a super genius or this has been discussed before and made it's way into the training data.

If an LLM can tell you the relatonship between a compost heap and nuclear bomb, that doesn't mean that was in the training data.

It could be because a compost heap "generates heat", and a nuclear bomb also "generates heat" and due to that relationship they have something in common. The model will pick up on these similar patterns. They tokens are positioned closer to each other in the high dimensional vector space.

But for any given "what does x have in common with y", that doesn't necessarily mean someone has asked that before and it's in the training data. Is that reasoning? I don't know ... how does the brain do it?

namarie
0 replies
14h38m

I can agree on the context windows, but what other output structure would you have?

mlsu
0 replies
10h32m

Disagree. The input/output structure (tokens) is the interface for both inference and for training. There is an emergent intellect embedded in the weights of the model. However, it is only accessible through the autoregressive token interface.

This is a fundamental limitation, much more fundamental than appears at first. It means that the only way to touch the model, and for the model to touch the world, is through the tokenizer (also, btw, why tokenizer is so essential to model performance). Touching the world through a tokenizer is actually quite limited.

So there is an intelligence in there for sure, but it is locked in an ontology that is tied to its interface. This is even more of a limitation than e.g. weights being frozen.

gpm
9 replies
15h9m

If you want to download a backup of a large chunk of human knowledge... download wikipedia. It's a similar size to a small LLM and can actually distinguish between real life and fantasy: https://en.wikipedia.org/wiki/Wikipedia:Database_download

If you just want to play around with an LLM though, absolutely.

int_19h
5 replies
11h44m

Kiwix provides prepackaged highly compressed archives of Wikipedia, Project Gutenberg, and many other useful things: https://download.kiwix.org/zim/.

Between that and dirt cheap storage prices, it is possible to have a local, offline copy of more human knowledge than one can sensibly consume in a lifetime. Hell, it's possible to have it all on one's smartphone (just get one with an SD card slot and shove a 1+ Tb one in there).

claritise
3 replies
5h47m

Just create a RAG with wikipedia as the corpus and a low parameter model to run it and you can basically have an instantly queryable corpus of human knowledge runnable on an old raspberry pi.

boywitharupee
1 replies
4h5m

but which model to tokenize with? is there a leaderboard for models that are good for RAG?

sroussey
0 replies
49m

“For RAG” is ambiguous.

First there is a leaderboard for embeddings. [1]

Even then, it depends how you use them. Some embeddings pack the highest signal in the beginning so you can truncate the vector, while most can not. You might want that truncated version for a fast dirty index. Same with using multiple models of differing vector sizes for the same content.

Do you preprocess your text? There will be a model there. Likely the same model you would use to process the query.

There is a model for asking questions from context. Sometimes that is a different model. [2]

CaptainOfCoit
0 replies
4h19m

a low parameter model

on an old raspberry pi

I bet the LLM responses will be great... You're better off just opening up a raw text dump of Wikipedia markup files in vim.

Workaccount2
0 replies
3h8m

Pretty neat to have laying around, thanks

CaptainOfCoit
2 replies
4h22m

actually distinguish between real life and fantasy

Are LLMs unable to distinguish between real life and fantasy? What prompts have you thrown at them to make this determination? Sending a small fairy tale and asking the LLM if it thinks it's a real story or fake one?

gpm
1 replies
3h58m

... having them talk about events from sci fi stories in response to questions about the real world. Having them confidently lie about pretty much everything. Etc.

CaptainOfCoit
0 replies
3h35m

What are the specific prompts you're using? You might get those answers when you're not being specific enough (or use models that aren't state of the art).

"Shit in, shit out" as the saying goes, but applied to conversations with LLMs where the prompts often aren't prescriptive enough.

texuf
5 replies
15h26m

Any recommendations for the latest and greatest way to run these locally?

speps
0 replies
15h25m

llamafile as per TFA...

slowmotiony
0 replies
10h20m

I use a tool called LM Studio, makes it trivial to run these models on a Mac. You can also use it as a local API so it kinda acts like a drop-in replacement for the openAI API.

fragmede
0 replies
13h40m

ollama

chown
0 replies
5h9m

I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you end up trying it, I would love to hear your feedback.

1: https://msty.app

creatonez
4 replies
14h19m

Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.

nyokodo
1 replies
12h32m

if I were trying to restart civilization and was only left with ChatGPT

In this scenario you’d need to also be left with a big chunk of compute, and power infrastructure. Since ChatGPT is the front end of the model you’d also need to have the internet still going in a minimum capacity.

CaptainOfCoit
0 replies
3h32m

If we're playing this game, you forgot to mention that they also need: A monitor, a keyboard, roof over their head (to prevent rain from entering your electronic), etc etc...

But really, didn't you catch the meaning of parents message, or are you being purposefully obtuse?

devsda
1 replies
9h44m

I think re-imagining the "Dr. Stone" series with the main character replaced by an LLM will be a funny & interesting series if we decide to stay true to LLMs nature and make it hallucinate as well.

Given the way LLMs are right now, I suspect there will be lot of failed experiments and the kingdom of science will not advance that quick.

latexr
0 replies
6h19m

the kingdom of science will not advance that quick.

It’s more likely that it wouldn’t even start. The first step to any development was figuring out nitric acid as the cure to the petrification. Good luck getting any LLM to figure that out. Even if it did, good luck getting any of the other characters to know what to do with that information that early on.

raincole
3 replies
12h32m

It seems to be an unbelievably inefficient way to back up knowledge.

samus
2 replies
10h58m

Are they though? They are lossy compressing trillions of tokens into a few dozen GB. The decompression action is fuzzy and inefficient though.

raincole
1 replies
10h9m

And it requires massive computational power to decompress, which I don't expect to be available in a catastrophic situation where humans have lost a large chunk of important knowledge.

samus
0 replies
9h46m

I don't necessarily agree. It requires massive computing power, but running models smaller than 70G parameters is possible on consumer hardware, albeit slowly.

TheCaptain4815
3 replies
14h30m

It’s kind of crazy really. Before LLMs, any type of world scale disaster you’d hope for what? Wikipedia backups? Now, a single LLM ran locally would be much more effective. Imagine the local models in 5 years!

int_19h
0 replies
11h41m

There's a lot more than just Wikipedia that gets archived, and yes, that is a far more sensible way to go about it. For one thing, the compute required to then read it back is orders of magnitude less (a 15 year old smartphone can handle it just fine). For another, you don't have to wonder how much of what you got back is hallucinated - data is either there or it's corrupted and unreadable.

danmur
0 replies
14h1m

Uh yeah I would, and still am, take the Wikipedia backup for doomsday scenarios. I'm not even sure how that would be a competition

Zambyte
0 replies
13h24m

The processing required to run current language models with a useful amount of knowledge encoded in them is way more than I imagine would be available in a "world scale disaster".

m3kw9
2 replies
13h38m

And why would I need to backup human knowledge as an individual

exe34
1 replies
8h48m

You remember those fantasies where you got up from your seat at the pub and punched the lights out of this guy for being rude? A lot of us have fantasies of being the all powerful oracle that guides a reboot of civilization using knowledge of science and engineering.

LunaSea
2 replies
9h50m

I wonder how the Chinese government will manage to sensor LLMs within China?

popol12
1 replies
7h17m

The same way Facebook/Google/openAI & others censored their own LLMs, I guess ?

LunaSea
0 replies
5h16m

That's only for SaaS LLMs, but if you can simply download and run one on your hardware, things become difficult.

kalleboo
1 replies
8h1m

I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.

It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.

pizzafeelsright
0 replies
3h28m

Feels like that really smart friend who is probably correct but ya just don't know.

dragonwriter
0 replies
41m

Language models are an inefficient way to store knowledge; if you want to have a “pseudo-backup of a large chunk of human knowledge,” download a wikipedia dump, not an LLM.

If you want a friendly but fallible UI to that dump, download an LLM and build a simplr ReAct framework around it with prompting to use the wikipedia dump for reference.

4bpp
26 replies
5h54m

It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).

[1] https://news.ycombinator.com/item?id=35393284

freedomben
7 replies
5h10m

Justine Tunney appears to enjoy extreme superstar status here

This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works. Cosmopolitan C and Actually Portable Executable were some of the things in the past that alone were worthy of significant respect, and for many people (like myself) these were our first introduction.

Speaking only for myself, I have a high opinion of Justine on technical merits. I'm sure she makes mistakes like all humans. I can tell she gets excited by discoveries and the chase, and that probably does sometimes cause premature celebration (this is something I struggle with so it's recognizable to me haha), but being wrong sometimes doesn't erase when you're right, and she has been spectacularly right a lot more times than most people I know.

There have been some personality clashes between Justine and others at times, and unfortunately it's situations where only part (sometimes a small part) of it was public, meaning we can only take people's word for what happened. Given my ignorance, I choose to withhold judgment here, but even if I didn't (and assumed she was guilty) it doesn't change the technical merits and it certainly wouldn't dissuade me from seeing what she's working on now.

So when I see stuff from Justine come out like this, it gets my attention. Would it get my attention if the same thing were posted by somebody whose name I don't recognize? Likely not, but I think that is (unfortunately) part of being a human. We aren't capable (yet!) of evaluating everything on technical merit alone because the shear volume of material far exceeds our time. Therefore we use other (less reliable to be true) signalling mechanisms as a way to quickly decide what is worthy of our time investment and what may not be. Reputation/name recognition is a much imperfect, but better than random chance, indicator.

4bpp
4 replies
4h19m

I don't know, my first (and main) impression of them was actually in the context of the llama.cpp mmap story, as I was somewhat involved in the project back then, and there I thought their impact on the project was predominantly negative. While they introduced a mildly beneficial change (mmap-based model loading), the way in which this was done was not healthy for the project - the changes were rammed through with little regard for concerns that existed at the time about backwards compatibility and edge cases that might be broken by the half-baked patch, Justine came across as self-aggrandizing (in the sense of "acting as if they ran the place", presenting their proposals as a plan that others must follow rather than suggestions) and overly eager to claim credit (epitomized by the injection of their own initials into the magic number file format identifier next to those of the project originator's, and the story of the hapless other author of the mmap changeset who was at first given a token acknowledgement but then quickly sidelined). Arguments for the inclusion of the patch seemed to be won by a combination of half- and untruths like those about memory savings and the sudden participation of a large number of previously uninvolved sycophants. It is fortunate that Georgi handled the fallout as well as he did, and that he in fact had amassed the social capital necessary to survive his heavy-handed solution (soft-banning both JT and their most prominent detractor). A less-successful project would probably have found itself captured or torn apart by the drama.

There is nothing wrong with holding people in esteem for their achievements, but in this case the degree of esteem really seems to be excessive. This is not a matter of simply being annoyed that people like "the wrong thing" - the mmap situation was significantly exacerbated by the presence of irrational/excessive supporters of Justine's as well as the irrational/excessive detractors that emerge wherever the former exist.

freedomben
3 replies
3h47m

I would like to know more about the mmap situation, as what I saw on the surface could warrant some concern. Being somewhat involved you would probably know better than I as I was just an observer reading the thread after-the-fact. It seemed like the biggest accusation was the plagiarism (or "collaborating" but mostly taking somebody else's code).

Did anybody besides the two parties see the code develop, or does anybody else have knowledge of this? Or is it just his word vs. hers? Do you have any suggested reading to get more perspective other than just the github thread and HN thread? (really asking. these aren't rhetorical questions)

Reading the thread, I do think there are a lot of opportunities to read in confirmation bias. For example if I start reading that thread with the idea that Justine is coming in to hijack the project and make herself the hero that it needs and deserves, and to get her initials embedded in there as a permanent tribute to her own glory, I can see that. But if I read it as her coming in with cool work that she's excited about, and had to come up with a new format and couldn't think of a name (naming things can be really hard) and just stuck in one of the first things that came to mind (or even used as a placeholder prior to discussion), I can see that as well.

I absolutely don't want the truth covered up, but I also don't want to accept as true things that aren't true, especially where the implications are toward somebody's character. I'm a big "benefit of the doubt" kind of person.

4bpp
2 replies
3h15m

My sense is that the part about credit/collaboration was actually somewhat overblown among the detractors. What roughly happened as far as I can remember is that JT and another person worked on mmap together with about equal contribution, though the other person might have been the one to have initiated the idea (and solicited help to push it through); then at some point JT decided to make a PR to the main repository in their own name, but crediting the other collaborator as a coauthor, which may or may not have been coordinated with the other person. After that, though, in a fairly characteristic fashion, JT started fielding adulatory questions from their fans (on Github, but also on HN, Twitter and possibly other media) about the change, and quickly switched to simply referring to it as their own, with no mention of the other contributor. The other contributor expressed some misgivings about having their contribution erased, which were picked up by a growing set of people who were generally resentful about JT's conduct in the project. As far as I can tell, when confronted about it, JT at no point explicitly denied what the other person did (and I think the commit logs should all still be there in the fork), but at some point the other person just decided to stop pushing the issue due to being uncomfortable with becoming a playing ball in the fandom war between JT fans and antis.

My personal main gripe with JT really was the tone they adopted in the Github discussions, and the effect of the large numbers of drive-by supporters, who were often far less restrained in both unfounded claims about Justine's accomplishments and attacks on any critics. (At this point I'd also like to note that I consider some sibling comments to be uncomfortably hostile in a personal way, like the "hit piece" one.) I think that as a public persona, especially one who actively pursues publicity, you have some responsibility to restrain your followers - Justine, I get the sense, instead uses them as deniable proxies, as also seen with the instances where instead of straight up putting their signature on the "RAM usage reduced to 6GB" claim they instead choose to post a collage of screenshots of supporters making it.

cryptonector
1 replies
2h41m

This could all be true, but it's hard to evaluate these claims on their own. Not being involved in any way, all I can do is conclude that there is some friction in that community. It's possible that JT is toxic, it's possible that you are toxic, it's possible that neither of you is generally toxic but something about your personalities causes your interactions to become toxic, it's even possible that neither of you were toxic in any way but your impression of things after the fact is as-if Tunney had been toxic. Sometimes one has to stop and think about these things and figure out how to smooth things over, and sometimes it's not possible to smooth things over.

4bpp
0 replies
1h19m

I didn't have any direct interactions with JT then or now - while it was hard to ignore the discussion as an onlooker, it did not touch upon any parts of the code that I was involved with. This seems to be one of the topics where everyone who is even tangentially involved is under a default suspicion of being biased in one direction or another.

llm_trw
1 replies
4h34m

This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works.

It is achieved through a never ending parade of self aggrandizement.

What Justine is very good at is presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand.

I had the misfortune of having to find out about her because of how thoroughly she polluted the google search space for lisp with her implementation of sector lisp. For some reason google decided that sector lisp needed to be in the top 5 results for every query about `minimal lisp with quotation` even when quotation wasn't implemented in her version.

cl3misch
0 replies
2h52m

presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand

Completely ignoring the JT discussion, the argument that something is trivial in some area does not really hold. 1) Science is mostly "just" connecting the dots, and 2) landmark discoveries tend to look trivial in hindsight almost by definition, because they have to be straightforward enough to be widely adopted.

mtlynch
6 replies
4h25m

HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after.

Where did Justine claim this? The link you provided is Justine saying that she doesn't have an explanation for the reduction in RAM and that readers shouldn't treat it as fact yet:

The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

Was the link supposed to show the false claim or the debunking of the claim?

4bpp
5 replies
3h54m

Plenty of claims about it, e.g. here as a "fact": https://github.com/ggerganov/llama.cpp/discussions/638#discu.... I don't think occasional expressions of lingering doubt (still couched among positive language like calling it a "miracle") can offset all the self-promotion that clearly seeks to maximise visibility of the implausible claim, even as it is attributed to others, as for example in https://twitter.com/JustineTunney/status/1641881145104297985... . A cereal manufacturer would probably be held responsible for package text like "Fruity Loops cured my cancer! - John, 52, Kalamazoo" too.

mtlynch
2 replies
3h32m

I don't read that as a claim of fact at all. From the link you shared:

Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse.

I haven't followed her work closely, but based on the links you shared, she sounds like she's doing the opposite of self-promotion and making outrageous claims. She's sharing the fact that she's observed an improvement while also disclosing her doubts that it could be experimental error. That's how open-source development is supposed to work.

So, currently, I have seen several extreme claims of Justine that turned out to be true (cosmopolitan libc, ape, llamafile all work as advertised), so I have a higher regard for Justine than the average developer.

You've claimed that Justine makes unwarranted claims, but the evidence you've shared doesn't support that accusation, so I have a lower regard for your claims than the average HN user.

4bpp
1 replies
3h11m

The very opening line says

I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!

The line you quoted occurs in a context where it is also implied that the low memory usage is a fact, and there might only be a bug insofar as that the model is being evaluated incorrectly. This is what is entailed by the assertion that it "is" sparse: that is, a big fraction of the parameters are not actually required to perform inference on the model.

wpietri
0 replies
2h41m

I think you are making a lot of soup from very little meat. I read those links the same way mtlynch read them. I think you're looking for a perfection of phrasing that is much more suited to peer-reviewed academic papers than random tweets and GitHub comments taken from the middle of exploring something. Seeing your initial comment and knowing little about the situation, I was entirely prepared to share your skepticism. But at this point I'm much more skeptical of you.

cryptonector
1 replies
2h51m

Where's the 30B-in-6GB claim? ^FGB in your GH link finds [0] which is neither by jart nor by ggerganov but by another user who promptly gets told to look at [1] where Justine denies that claim.

  [0] https://github.com/antimatter15/alpaca.cpp/issues/182
  [1] https://news.ycombinator.com/item?id=35400066

4bpp
0 replies
2h43m

These all postdate the discussions that I linked (from March 31st). By April 1st JT themselves seems to have stopped making/boosting the claim about low memory usage.

quest88
3 replies
3h57m

What's the point of your comment if you're not going to do the work yourself? If you don't have something nice to say then don't say it.

The "hey this may or may not be true so someone go figure it out" is lazy, self-gratifying and pointless.

thebytefairy
1 replies
3h49m

I think it's very helpful for someone to point out that the source has been shown to be unreliable before, and we should wait for more verification from others knowledgable in the space.

freedomben
0 replies
3h36m

Agreed. I think there's a blurry gray line between pointing out a potentially unreliable source and a lazy dismissal, but if there's reasonable doubt I think it's good for HN. If the doubt isn't reasonable, it will be torn apart by other commenters, and then it's an explicit discussion that people can read and decide on

renewiltord
0 replies
2h18m

It's really popular online. I think that's because many people here read a lot of this content but don't actually have the skill or background to do analysis. So they give us history rather than examination. Which has some value, I suppose.

rpdillon
2 replies
3h52m

This comment reads like real scientific skepticism, but from my recollection of events, is more of a hit piece that takes what should be a technical discussion and drags in bunch of personal baggage. In particular:

HN has previously fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model,

is not true at all. Someone else made the claims about 6GB RAM usage for a 30B model, I remember reading it at the time and thinking "Yeah, that doesn't make sense, but the loading time improvement is immense!" And it was - I run all my LLMs locally on CPU because I don't have dedicated hardware, and jart's work has improved usability a lot.

and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation

I was reading the same HN discussions you were at the time, and it was pretty trivial to see that the loading time claim held up, and the RAM claim was dubious and likely simply due to not understanding some effect of the change completely. Heck, jart's own discussion of the topic reflected this at the time.

For the current change, I feel like your comment is even more misplaced. The blog post linked to for this story has a huge amount of detail about performance on specific processors (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with specific models. So when you say:

It would be good to see some independent verification of this claim.

What I hear is "4bpp thinks there's a real risk the numbers in the linked post are fabricated, and jart is just trying to get attention."

And that doesn't seem reasonable at all, given the history of her work and the evidence in front of us.

throwup238
0 replies
3h25m

I distinctly remember most of the people in the comments misunderstanding kernel memory paging or learning about it for the first time.

It genuinely did make llama.cpp a lot more usable at the time.

4bpp
0 replies
3h3m

The loading time improvements largely held up, and on the balance the mmap contribution was ultimately good (though the way it was implemented was really quite problematic, as a matter of process and communication). However, as I point out in https://news.ycombinator.com/item?id=39894542, JT quite unambiguously did try to cash in on the "low memory usage" claim - uncritically reprinting positive claims by others about your own work that otherwise would have been largely invisible should really not be treated differently as making those claims yourself.

I do think that there is a real risk that the numbers are wrong (not necessarily "fabricated", as this implies malfeasance, but possibly based on an erroneous measurement insufficiently questioned due to an excess of trust from themselves and others, as the mmap ones were). This is also in part based on the circumstance that at the time (of the mmap story, and myself being more involved in the project) I was actually involved in trying to optimise the SIMD linear algebra code, and unless llama.cpp has since switched to a significantly less performant implementation the proposition that so much more performance could be squeezed out strikes me as quite surprising. Here, your intuitions may say that Justine Tunney is just so brilliant that they make the seemingly impossible possible; but it was exactly this attitude that at the time made it so hard to evaluate the mmap memory usage claims rationally and turned the discussion around it much more dysfunctional than it had to be.

larodi
1 replies
3h49m

All the core llama.cpp devs are superstar devs and 10x devs or whatever you want to call a super smart person who is also super productive and very good with applied calculus. Jart is very apparently very smart, but their relationship with this project was not without turbulence and at present they (jart) are not a core dev of llama.cpp. So for a while lots of their (i'd like to write her moves, but not sure if correct) actions seem to be aimed at getting attention and perhaps particularly the attention of the same folk.

On the contrary ggerganov, slaren, JohannesGaessler seem to have never chased this sensationalist superstatus, but actually leave their work to speak for them. You'll barely find comments by these people on HN, while jart figures every so often a way to manifest themselves some way on HN. And this behaviour on jart's part now bears fruits - for example Phoronix' Michael Larabel would praise jart for their work on the llamafile, absolutely obliterating the fact that it is largely based on the wonderful work of ggerganov at al.

__turbobrew__
0 replies
2h42m

When they claimed to drastically improve memory utilization through the use of memory maps, despite not doing so and then starting a huge controversy which derailed the project I would say they were a 0.1x dev not a 10x dev.

leeoniya
0 replies
5h1m

and indeed was debunked shortly after

was also surprised that she continues to mention the mmap thing in a positive light even after the facts about the claim were settled to the contrary, even disregarding the whole attribution fiasco.

azeirah
0 replies
4h40m

You can simply check the Pull Request on llama.cpp on Github. JohanesGaessler (a core maintainer) has already ran the code and says it's an impressive speed-up. There isn't a thorough review by any of the core maintainers yet, but this is very likely just exactly what justine says it is; various significant and insignificant speedups.

speps
15 replies
10h25m

Regarding this bit at the end:

I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS

If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?

WithinReason
11 replies
9h48m

Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.

TuringNYC
4 replies
7h14m

To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.

I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.

jorvi
3 replies
6h3m

Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.

TuringNYC
1 replies
4h41m

BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.

Sounds kludge, but introduce enough constraints and you end up with this as the best solution.

nickpsecurity
0 replies
12m

Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?

newswasboring
0 replies
5h27m

If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.

radarsat1
3 replies
8h7m

Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.

treffer
2 replies
7h12m

A nice example of this is fftw which has hundreds (if not thousands) of generated methods to do the fft math. The whole project is a code generator.

It can then after compilation benchmark these, generate a wisdom file for the hardware and pick the right implementation.

Compared with that "a few" implementations of the core math kernel seem like an easy thing to do.

naasking
0 replies
5h4m

Not exactly comparable, as you said, the FFTW implementations are auto-generated but it doesn't sound like these few implementations will be.

bee_rider
0 replies
4h31m

ATLAS was an automatically tuned BLAS, but it’s been mostly supplanted by ones using the hand-tuned kernel strategy.

surge
1 replies
5h11m

Maybe its a dumb question, but isn't something like OpenCL meant to solve this problem?

jvanderbot
0 replies
4h59m

From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.

Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.

larodi
2 replies
9h45m

llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.

homarp
1 replies
8h49m

Here are links to the most recent pull requests sent

    https://github.com/ggerganov/llama.cpp/pull/6414
    https://github.com/ggerganov/llama.cpp/pull/6412

speps
0 replies
6h8m

This doesn't relate to GPU kernels unfortunately.

1-6
15 replies
15h27m

Question is, how much of an improvement has it gotten to over a GPU or ASIC?

dartos
8 replies
15h9m

Nothing in software will ever beat an equivalent ASIC.

fulafel
3 replies
10h51m

Most ASICs are cost or power optimizations.

dartos
2 replies
6h10m

Exactly. They’re much faster for their specific tasks and thus are more power efficient and potentially cost efficient

fulafel
1 replies
4h14m

No. Eg of the hardware discussed on the article, the Raspberry Pi uses an ASIC that's slow, cheap and low power vs the Intel or AMD chips.

In some cases ASICs are faster than general purpouse CPUs, but usually not.

LtdJorge
0 replies
1h36m

Is the LLM running on an ASIC for the Pi here? I dout it.

postalrat
1 replies
13h52m

Sure there is. Software is easy to change.

dartos
0 replies
6h9m

By “beat” I meant in performance.

Obviously you can’t change an asic

fragmede
1 replies
13h19m

an asic is fixed function, so it'll never be able to boot my pc and then be the CPU, even though an asic beats the pants off anything else computing Sha hashes for Bitcoin mining.

dartos
0 replies
6h11m

By “beat” I meant performance.

Obviously an ASIC is not a general purpose machine like a cpu.

gpapilion
2 replies
12h41m

So... I was struggling with this for a while. I would says anywhere from 2x to an order of magnitude faster with a GPU. (I've been looking at a lot of GPU benchmarks lately, and they are REALLY hard to compare since they are all so specific)

I do think long term there gets to be more hope for CPUs here with inference largely because memory bandwidth becomes more important than the gpu. You can see this with reports of the MI-300 series outperforming h100, largely because it has more memory bandwidth. MCR dimms give you close to 2x the exiting memory bw in intel cpus, and when coupled with AMX you may be able to exceed v100 and might touch a100 performance levels.

HBM and the general GPU architecture gives it a huge memory advantage, especially with the chip to chip interface. Even adding HBM to a CPU, you are likely to find the CPU is unable to use the memory bw effectively unless it was specifically designed to use it. Then you'd still likely have limited performance with things like UPI being a really ugly bottleneck between CPUs.

imtringued
1 replies
10h2m

If someone releases DDR5 or DDR6 based PIM, then most of the memory bandwidth advantage of GPUs evaporates overnight. I expect CPUs to be king at inference in the future.

gpapilion
0 replies
9h41m

But then you'll get GDDR6 delivered via HBM5 or whatever. I don't think CPUs will ever really keep up with the memory bandwidth, because for most applications it doesn't matter.

MCR DIMM is like 1/2 the memory bandwidth that is possible with HBM4, plus it requires you to buy something like 2TB of memory. It might get there, but I'd keep my money on hbm and gpus.

yjftsjthsd-h
0 replies
14h3m

I think that should be phrased more like "what fraction of GPU speed can this reach?", because it'll always be less than 1x.

jchw
0 replies
7h23m

I think I understand what you are thinking. You may be fixing "than other ways of running them" to the end of the title, but it's actually "than it was on CPU before now".

baq
0 replies
10h28m

From the article, passage about the 14900k:

For example, when I run my spam.sh shell script, it only takes 420 milliseconds, which is 7x faster than my Raspberry Pi 5. That's right, when it comes to small workloads, this chip is able to finish before CUDA even gets started.

So… it depends :)

kiratp
8 replies
15h17m

It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.

https://github.com/ggerganov/llama.cpp/issues/2555

yjftsjthsd-h
4 replies
15h3m

This project in particular seems to care about the long tail of hardware; note that the very first machine in this post is a box from 2020 with spinning rust disk. Granted, adding support for newer extensions is likely also good, but cost/benefit is in play.

taneq
3 replies
14h22m

Is four years really 'long tail' these days? Our VM host box is from 2010 (and I had to rebuild llama.cpp locally without AVX to get it working :P )

yjftsjthsd-h
0 replies
14h4m

For cutting-edge LLM work, probably? I mean, I run mine on older hardware than that, but I'm a total hobbyist...

refulgentis
0 replies
3h7m

For LLMs...yeah. I imagine you're measuring in tokens/minute with that setup. So its possible, but...do you use it much? :)

d416
0 replies
12h19m

It should be noted that while the HP Prodesk was released in 2020, the CPU’s Skylake architecture was designed in 2014. Architecture is a significant factor in this style of engineering gymnastics to squeeze the most out of silicon.

luyu_wu
0 replies
14h7m

I don't believe that is the target for a local LLM... Pretty sure we're talking about client-side computing, of which the newest supports only AVX-512 (and even that sketchily on Intel's side).

kristianp
0 replies
12h49m

Just buy a new AMD processor that supports AVX512.

baq
0 replies
10h31m

People with Sapphire Rapids options are not the target audience of these patches

s_Hogg
7 replies
5h23m

I'd pay good money to watch jart in conversation with Carmack

Solvency
6 replies
4h32m

Carmack is great but completely irrelevant here. He missed the entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality fantasies for years.

vinkelhake
4 replies
3h9m

Completely irrelevant is probably overstating it. He's been working on AI for the last 4+ years.

Solvency
2 replies
2h27m

He literally squandered the last 10 years of his life working on absolutely nothing for Zuckerberg. And only after the rest of the world innovated on AI (transformers, etc) did he clearly feel embarrassed and had to proclaim he's going to focus on AGI in a "one-up" way.

talldayo
0 replies
1h45m

He literally squandered the last 10 years of his life working on absolutely nothing

Speak for yourself, the Oculus Quest is the coolest piece of sub-$500 tech in my home.

fkyoureadthedoc
0 replies
1h21m

He got paid a lot to do something he was presumably passionate about and enjoyed. It also might surprise you to find out that there's quite a lot of people that just work as a means to an end, and find value and enjoyment primarily from other parts of their life.

cactusplant7374
0 replies
2h50m

He's striving for AGI though, right? So he's not really working on anything because he certainly hasn't discovered AGI.

cactusplant7374
0 replies
2h48m

Altman isn't even relevant here. He is focusing on LLM's instead of a framework that gets us to AGI. He can't describe how we get there or any such theories around AGI. It's a complete failure.

marshallward
7 replies
4h20m

There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.

The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.

Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.

Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.

steppi
2 replies
3h44m

The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.

Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.

[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

[1] https://github.com/OpenMathLib/OpenBLAS

[2] https://www.intel.com/content/www/us/en/developer/tools/onea...

[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...

[4]https://en.wikipedia.org/wiki/BLIS_(software)

marshallward
1 replies
3h0m

Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.

I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.

steppi
0 replies
2h9m

Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.

[0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf

pklausler
1 replies
3h12m

Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.

DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.

marshallward
0 replies
2h59m

How do you feel about Nvidia endorsing do concurrent migration to GPUs? Would that be classified as parallelization?

brrrrrm
1 replies
1h20m

using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.

marshallward
0 replies
43m

I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.

tiffanyh
6 replies
5h56m

Pixar uses CPUs …

I wonder if we’ll end up in a situation like rendered movies.

Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

https://news.ycombinator.com/item?id=25616372

kreco
2 replies
5h31m

Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

I wonder if (or when) this will change once integrated GPUs become "mainstream", the CPU/GPU share the same RAM AFAIK.

rockwotj
1 replies
5h16m

I expect GPU hardware to specialize like Google’s TPU. The TPU feels like ARM in these AI workloads where when you start to run these at scale, you’ll care about the cost perf tradeoff for most usecases.

CPU/GPU share the same RAM AFAIK.

This depends on the GPU I believe Apple has integrated memory, but most GPUs from my limited experience writing kernels have their own memory. CUDA pretty heavily has a device memory vs host memory abstraction.

talldayo
0 replies
1h51m

On top of that, Nvidia has provided a unified addressing abstraction over PCI for a looooong time via CUDA: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

Customers like Pixar could probably push this even further, with a more recent Nvidia rack and Mellanox networking. Networking a couple Mac Studios over Thunderbolt doesn't have a hope of competing, at that scale.

CaptainOfCoit
2 replies
5h29m

I'm not sure how true that is anymore, from the outside it seems they're at least moving to a CPU/GPU hybrid (which makes a lot of sense), at least judging by new features landing in RenderMan that continues to add more support for GPUs (like XPU).

tiffanyh
1 replies
5h27m

Isn’t this more of a function that RenderMan is a sold product.

And it’s expected to at least support GPUs.

CaptainOfCoit
0 replies
5h13m

Hard to know without getting information from people at Pixar really.

Not sure how much sense it would make for Pixar to spend a lot of engineering hours for things they wouldn't touch in their own rendering pipeline. As far as I know, most of the feature development comes from their own rendering requirements rather than from outside customers.

wokwokwok
5 replies
14h20m

You don't need a large computer to run a large language model

While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.

Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.

That doesn’t mean “you don’t need a computer to run an LM”…

I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.

I don’t realllly believe you can do a lot of useful LLM work on a pi

mlyle
2 replies
14h13m

Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.

jerrygenser
1 replies
7h18m

You can fine-tune a 60mm parameter (e.g. distilBERT) discriminative (not generative) language model and it's one or two order of magnitude more efficient for classification tasks like sentiment analysis, and probably similar if not more accurate.

mlyle
0 replies
3h51m

Yup, I'm not saying TinyLLAMA is minimal, efficient, etc (indeed, that is just saying that you can take models even smaller). And a whole lot of what we just throw LLMs at is not the right tool for the job, but it's expedient and surprisingly works.

samus
0 replies
10h51m

Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.

SoothingSorbet
0 replies
9h5m

I've gotten some useful stuff out of 7B param LLMs, and that should fit on a Pi quantized.

none_to_remain
5 replies
14h1m

From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"

I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it

mvkel
3 replies
13h45m

Is that what it does, though?

I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:

- this is a spam email

But if the sender adapts and says

- th1s is a spam email

It wouldn't be flagged as spam.

none_to_remain
1 replies
13h28m

My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.

refulgentis
0 replies
3h8m

It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).

With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.

I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.

samus
0 replies
10h28m

The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.

Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.

Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.

If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.

moffkalast
0 replies
8h51m

I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.

LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.

mijoharas
3 replies
8h53m

Has Justine written anywhere about her disassembly setup?

I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.

I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?

It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)

mijoharas
0 replies
3h49m

Thanks! I need to get better at googling I guess.

gpderetta
0 replies
43m

Nice. I have been using rmsbolt for a similar feature, but it is very rough. I'll need to give this as try.

discordance
3 replies
15h25m

"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "

Can anyone here answer why this is?

bishfish
1 replies
14h14m

Plus he isn’t using oflag=direct, so since output file is small it isn’t even making it to disk. I think it would only be sent to page cache. I’m afraid he is testing CPU and memory (bus) speeds here.

oflag=direct will write direct and bypass page cache.

fweimer
0 replies
13h31m

Exactly. Something is very fishy if this system only writes 1.6 GB/s to the page cache. Probably that dd command line quoted in the article is incomplete.

pstrateman
0 replies
14h34m

Apple made fsync a noop.

You have to make a different call to get sync on macos.

So tons is stuff is faster because it's not actually writing to disk.

aniijbod
3 replies
15h8m

A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.

talldayo
0 replies
1h44m

If you ignore my capacity for error, I bet I'd put up a good score too. Hell, maybe Markov chains are smarter than LLMs by this definition.

samus
0 replies
10h18m

We shouldn't choose LLMs for how many facts they support, but their capability to process human language. There is some overlap between these two though, but an LLM that just doesn't know something can always be augmented with RAG capabilities.

JKCalhoun
0 replies
15h7m

Picturing "LLM Jeopardy". You know, a game show.

aaronscott
3 replies
1h15m

I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.

This is great. I love the idea of measuring performance differences in “years of Moore’s law.”

Twenty years puts the delta in an easy to understand framework.

JohnKemeny
2 replies
51m

I doubt that you get Python to run faster than C++ at 2004 hardware.

mrtranscendence
1 replies
44m

Python on 2024 hardware vs C++ on 2004 hardware ... I don't think it's obvious that C++ always wins here, though it would depend on the use case, how much of the Python is underpinned by native libraries, and the specific hardware in question.

JohnKemeny
0 replies
38m

If we allow native libraries, it's not clear that C++ would win, even on modern hardware.

AbuAssar
3 replies
6h21m

regarding AMD zen4 with avx512:

"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."

reckless
2 replies
5h44m

Does this also count platform costs or just chip cost? I'd imagine the threadripper motherboard and ram costs aren't insignificant

KennyBlanken
1 replies
1h23m

A complete desktop computer with the M2 Ultra w/64GB of RAM and 1TB of SSD is $4k.

The 7995WX processor alone is $10k, the motherboard is one grand, the RAM is another $300. So you're up to $11300, and you still don't have a PSU, case, SSD, GPU....or heatsink that can handle the 300W TDP of the threadripper processor; you're probably looking at a very large AIO radiator to keep it cool enough to get its quoted performance. So you're probably up past $12k, 3x the price of the Studio...more like $14k if you want to have a GPU of similar capability to the M2 Ultra.

Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.

incrudible
0 replies
10m

So from a CPU perspective you get 7x the CPU throughput for 3x to 4x the price, plus upgradable RAM that is massively cheaper. The M2 uses the GPU for LLMs though, and there it sits in a weird spot where 64GB of (slower) RAM plus midrange GPU performance is not something that exists in the PC space. The closest thing would probably be a (faster) 48GB Quadro RTX which is in the $5000 ballpark. For other use cases where VRAM is not such a limiting factor, the comparably priced PC will blow the Mac out of the water, especially when it comes to GPU performance. The only reason we do not have cheap 96GB GDDR GPUs is that it would cannibalize NVIDIA/AMDs high margin segment. If this was something that affected Apple, they would act the same.

bee_rider
2 replies
14h18m

Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.

fragmede
1 replies
13h30m

did you see tiny grad can run llama and stable diffusion? it's an intentionally extremely simple framework vs pytorch or even micrograd, which helped me dig into the underlying math. though https://spreadsheets-are-all-you-need.ai/ is a good one for learning LLMs.

bee_rider
0 replies
13h25m

I haven’t seen that. I’ll definitely have to take a look, thanks!

kpw94
1 replies
14h28m

Great links, especially last one referencing the Goto paper:

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...

> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references

It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.

larodi
0 replies
9h41m

The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.

imtringued
1 replies
10h8m

From what I have heard they use manual spin locks. Generally, spin locks are not a good idea unless you want to dedicate the entire machine to a single application. If the process a spinlock waits on gets suspended, you're burning CPU time for nothing. The OS thinks a spinlock making zero progress is actually a high priority process, so it is starving the suspended process from making progress.

Ono-Sendai
0 replies
5h29m

Yeah the code looks like a spinlock. It behaves terribly under contention, resulting in performance falling off a cliff as the number of threads increases. Adding more threads actually slows down the total performance.

I would fix it if I could be bothered. Instead I will just use the Cuda whisper backend which is pretty nice and fast.

tomp
1 replies
6h29m

TL;DR: unroll the outer two loops of matrix multiplication

amelius
0 replies
4h56m

Shouldn't this have been done in a library instead of a specific project? Then others could also profit from it.

seangrogg
1 replies
11h32m

Mmm, I wonder how well this would work on a mobile device. Maybe I'll try grabbing my ubuntu touch here in a sec...

seangrogg
0 replies
7h1m

(For any who were curious: it does not for memory reasons)

kristianp
1 replies
12h35m

Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).

dagaci
0 replies
10h16m

Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.

There are two PRs raised to merge to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6414

https://github.com/ggerganov/llama.cpp/pull/6412

Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.

Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.

arendtio
1 replies
2h42m

Does someone else see llamafile using Wine on Linux?

Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile

jart
0 replies
30m

There's a simple fix for that.

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo chmod +x /usr/bin/ape
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-fil...

yieldcrv
0 replies
1h59m

note, this is "goes faster on CPUs than before", not faster than GPUs.

wtallis
0 replies
12h25m

I know this post is focused specifically on CPU performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.

tubs
0 replies
4h22m

The ram is not on the cpu on a mac. It's in the same can but it's still regular ddr dimms.

sublimefire
0 replies
6h4m

re:funding

my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.

rbnsl
0 replies
10m

Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out

politelemon
0 replies
12h23m

This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?

Will definitely be giving this a try.

pknerd
0 replies
10h0m

So, I can now run it on my 2015 Macbook with 8GB RAM?

pama
0 replies
15h21m

Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.

moffkalast
0 replies
8h29m

the Raspberry Pi

Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.

llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second

llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second

It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.

miki123211
0 replies
6h2m

If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?

m3kw9
0 replies
4h55m

So Nvidia in trouble now because intel can be used instead for faster/cheaper? inference?

jongjong
0 replies
13h7m

That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.

To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).

But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.

isusmelj
0 replies
9h2m

Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?

ein0p
0 replies
1h19m

As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.

But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.

aimonster2
0 replies
6h7m

Posted too early.

TimPC
0 replies
1h36m

Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".

6r17
0 replies
6h55m

today being today ; I must ask ; anyone has actually tried this ?