I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.
It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).
Justine Tunney appears to enjoy extreme superstar status here
This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works. Cosmopolitan C and Actually Portable Executable were some of the things in the past that alone were worthy of significant respect, and for many people (like myself) these were our first introduction.
Speaking only for myself, I have a high opinion of Justine on technical merits. I'm sure she makes mistakes like all humans. I can tell she gets excited by discoveries and the chase, and that probably does sometimes cause premature celebration (this is something I struggle with so it's recognizable to me haha), but being wrong sometimes doesn't erase when you're right, and she has been spectacularly right a lot more times than most people I know.
There have been some personality clashes between Justine and others at times, and unfortunately it's situations where only part (sometimes a small part) of it was public, meaning we can only take people's word for what happened. Given my ignorance, I choose to withhold judgment here, but even if I didn't (and assumed she was guilty) it doesn't change the technical merits and it certainly wouldn't dissuade me from seeing what she's working on now.
So when I see stuff from Justine come out like this, it gets my attention. Would it get my attention if the same thing were posted by somebody whose name I don't recognize? Likely not, but I think that is (unfortunately) part of being a human. We aren't capable (yet!) of evaluating everything on technical merit alone because the shear volume of material far exceeds our time. Therefore we use other (less reliable to be true) signalling mechanisms as a way to quickly decide what is worthy of our time investment and what may not be. Reputation/name recognition is a much imperfect, but better than random chance, indicator.
I don't know, my first (and main) impression of them was actually in the context of the llama.cpp mmap story, as I was somewhat involved in the project back then, and there I thought their impact on the project was predominantly negative. While they introduced a mildly beneficial change (mmap-based model loading), the way in which this was done was not healthy for the project - the changes were rammed through with little regard for concerns that existed at the time about backwards compatibility and edge cases that might be broken by the half-baked patch, Justine came across as self-aggrandizing (in the sense of "acting as if they ran the place", presenting their proposals as a plan that others must follow rather than suggestions) and overly eager to claim credit (epitomized by the injection of their own initials into the magic number file format identifier next to those of the project originator's, and the story of the hapless other author of the mmap changeset who was at first given a token acknowledgement but then quickly sidelined). Arguments for the inclusion of the patch seemed to be won by a combination of half- and untruths like those about memory savings and the sudden participation of a large number of previously uninvolved sycophants. It is fortunate that Georgi handled the fallout as well as he did, and that he in fact had amassed the social capital necessary to survive his heavy-handed solution (soft-banning both JT and their most prominent detractor). A less-successful project would probably have found itself captured or torn apart by the drama.
There is nothing wrong with holding people in esteem for their achievements, but in this case the degree of esteem really seems to be excessive. This is not a matter of simply being annoyed that people like "the wrong thing" - the mmap situation was significantly exacerbated by the presence of irrational/excessive supporters of Justine's as well as the irrational/excessive detractors that emerge wherever the former exist.
I would like to know more about the mmap situation, as what I saw on the surface could warrant some concern. Being somewhat involved you would probably know better than I as I was just an observer reading the thread after-the-fact. It seemed like the biggest accusation was the plagiarism (or "collaborating" but mostly taking somebody else's code).
Did anybody besides the two parties see the code develop, or does anybody else have knowledge of this? Or is it just his word vs. hers? Do you have any suggested reading to get more perspective other than just the github thread and HN thread? (really asking. these aren't rhetorical questions)
Reading the thread, I do think there are a lot of opportunities to read in confirmation bias. For example if I start reading that thread with the idea that Justine is coming in to hijack the project and make herself the hero that it needs and deserves, and to get her initials embedded in there as a permanent tribute to her own glory, I can see that. But if I read it as her coming in with cool work that she's excited about, and had to come up with a new format and couldn't think of a name (naming things can be really hard) and just stuck in one of the first things that came to mind (or even used as a placeholder prior to discussion), I can see that as well.
I absolutely don't want the truth covered up, but I also don't want to accept as true things that aren't true, especially where the implications are toward somebody's character. I'm a big "benefit of the doubt" kind of person.
My sense is that the part about credit/collaboration was actually somewhat overblown among the detractors. What roughly happened as far as I can remember is that JT and another person worked on mmap together with about equal contribution, though the other person might have been the one to have initiated the idea (and solicited help to push it through); then at some point JT decided to make a PR to the main repository in their own name, but crediting the other collaborator as a coauthor, which may or may not have been coordinated with the other person. After that, though, in a fairly characteristic fashion, JT started fielding adulatory questions from their fans (on Github, but also on HN, Twitter and possibly other media) about the change, and quickly switched to simply referring to it as their own, with no mention of the other contributor. The other contributor expressed some misgivings about having their contribution erased, which were picked up by a growing set of people who were generally resentful about JT's conduct in the project. As far as I can tell, when confronted about it, JT at no point explicitly denied what the other person did (and I think the commit logs should all still be there in the fork), but at some point the other person just decided to stop pushing the issue due to being uncomfortable with becoming a playing ball in the fandom war between JT fans and antis.
My personal main gripe with JT really was the tone they adopted in the Github discussions, and the effect of the large numbers of drive-by supporters, who were often far less restrained in both unfounded claims about Justine's accomplishments and attacks on any critics. (At this point I'd also like to note that I consider some sibling comments to be uncomfortably hostile in a personal way, like the "hit piece" one.) I think that as a public persona, especially one who actively pursues publicity, you have some responsibility to restrain your followers - Justine, I get the sense, instead uses them as deniable proxies, as also seen with the instances where instead of straight up putting their signature on the "RAM usage reduced to 6GB" claim they instead choose to post a collage of screenshots of supporters making it.
This could all be true, but it's hard to evaluate these claims on their own. Not being involved in any way, all I can do is conclude that there is some friction in that community. It's possible that JT is toxic, it's possible that you are toxic, it's possible that neither of you is generally toxic but something about your personalities causes your interactions to become toxic, it's even possible that neither of you were toxic in any way but your impression of things after the fact is as-if Tunney had been toxic. Sometimes one has to stop and think about these things and figure out how to smooth things over, and sometimes it's not possible to smooth things over.
I didn't have any direct interactions with JT then or now - while it was hard to ignore the discussion as an onlooker, it did not touch upon any parts of the code that I was involved with. This seems to be one of the topics where everyone who is even tangentially involved is under a default suspicion of being biased in one direction or another.
This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works.
It is achieved through a never ending parade of self aggrandizement.
What Justine is very good at is presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand.
I had the misfortune of having to find out about her because of how thoroughly she polluted the google search space for lisp with her implementation of sector lisp. For some reason google decided that sector lisp needed to be in the top 5 results for every query about `minimal lisp with quotation` even when quotation wasn't implemented in her version.
presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand
Completely ignoring the JT discussion, the argument that something is trivial in some area does not really hold. 1) Science is mostly "just" connecting the dots, and 2) landmark discoveries tend to look trivial in hindsight almost by definition, because they have to be straightforward enough to be widely adopted.
HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after.
Where did Justine claim this? The link you provided is Justine saying that she doesn't have an explanation for the reduction in RAM and that readers shouldn't treat it as fact yet:
The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
Was the link supposed to show the false claim or the debunking of the claim?
Plenty of claims about it, e.g. here as a "fact": https://github.com/ggerganov/llama.cpp/discussions/638#discu.... I don't think occasional expressions of lingering doubt (still couched among positive language like calling it a "miracle") can offset all the self-promotion that clearly seeks to maximise visibility of the implausible claim, even as it is attributed to others, as for example in https://twitter.com/JustineTunney/status/1641881145104297985... . A cereal manufacturer would probably be held responsible for package text like "Fruity Loops cured my cancer! - John, 52, Kalamazoo" too.
I don't read that as a claim of fact at all. From the link you shared:
Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse.
I haven't followed her work closely, but based on the links you shared, she sounds like she's doing the opposite of self-promotion and making outrageous claims. She's sharing the fact that she's observed an improvement while also disclosing her doubts that it could be experimental error. That's how open-source development is supposed to work.
So, currently, I have seen several extreme claims of Justine that turned out to be true (cosmopolitan libc, ape, llamafile all work as advertised), so I have a higher regard for Justine than the average developer.
You've claimed that Justine makes unwarranted claims, but the evidence you've shared doesn't support that accusation, so I have a lower regard for your claims than the average HN user.
The very opening line says
I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!
The line you quoted occurs in a context where it is also implied that the low memory usage is a fact, and there might only be a bug insofar as that the model is being evaluated incorrectly. This is what is entailed by the assertion that it "is" sparse: that is, a big fraction of the parameters are not actually required to perform inference on the model.
I think you are making a lot of soup from very little meat. I read those links the same way mtlynch read them. I think you're looking for a perfection of phrasing that is much more suited to peer-reviewed academic papers than random tweets and GitHub comments taken from the middle of exploring something. Seeing your initial comment and knowing little about the situation, I was entirely prepared to share your skepticism. But at this point I'm much more skeptical of you.
Where's the 30B-in-6GB claim? ^FGB in your GH link finds [0] which is neither by jart nor by ggerganov but by another user who promptly gets told to look at [1] where Justine denies that claim.
[0] https://github.com/antimatter15/alpaca.cpp/issues/182
[1] https://news.ycombinator.com/item?id=35400066
These all postdate the discussions that I linked (from March 31st). By April 1st JT themselves seems to have stopped making/boosting the claim about low memory usage.
What's the point of your comment if you're not going to do the work yourself? If you don't have something nice to say then don't say it.
The "hey this may or may not be true so someone go figure it out" is lazy, self-gratifying and pointless.
I think it's very helpful for someone to point out that the source has been shown to be unreliable before, and we should wait for more verification from others knowledgable in the space.
Agreed. I think there's a blurry gray line between pointing out a potentially unreliable source and a lazy dismissal, but if there's reasonable doubt I think it's good for HN. If the doubt isn't reasonable, it will be torn apart by other commenters, and then it's an explicit discussion that people can read and decide on
It's really popular online. I think that's because many people here read a lot of this content but don't actually have the skill or background to do analysis. So they give us history rather than examination. Which has some value, I suppose.
This comment reads like real scientific skepticism, but from my recollection of events, is more of a hit piece that takes what should be a technical discussion and drags in bunch of personal baggage. In particular:
HN has previously fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model,
is not true at all. Someone else made the claims about 6GB RAM usage for a 30B model, I remember reading it at the time and thinking "Yeah, that doesn't make sense, but the loading time improvement is immense!" And it was - I run all my LLMs locally on CPU because I don't have dedicated hardware, and jart's work has improved usability a lot.
and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation
I was reading the same HN discussions you were at the time, and it was pretty trivial to see that the loading time claim held up, and the RAM claim was dubious and likely simply due to not understanding some effect of the change completely. Heck, jart's own discussion of the topic reflected this at the time.
For the current change, I feel like your comment is even more misplaced. The blog post linked to for this story has a huge amount of detail about performance on specific processors (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with specific models. So when you say:
It would be good to see some independent verification of this claim.
What I hear is "4bpp thinks there's a real risk the numbers in the linked post are fabricated, and jart is just trying to get attention."
And that doesn't seem reasonable at all, given the history of her work and the evidence in front of us.
I distinctly remember most of the people in the comments misunderstanding kernel memory paging or learning about it for the first time.
It genuinely did make llama.cpp a lot more usable at the time.
The loading time improvements largely held up, and on the balance the mmap contribution was ultimately good (though the way it was implemented was really quite problematic, as a matter of process and communication). However, as I point out in https://news.ycombinator.com/item?id=39894542, JT quite unambiguously did try to cash in on the "low memory usage" claim - uncritically reprinting positive claims by others about your own work that otherwise would have been largely invisible should really not be treated differently as making those claims yourself.
I do think that there is a real risk that the numbers are wrong (not necessarily "fabricated", as this implies malfeasance, but possibly based on an erroneous measurement insufficiently questioned due to an excess of trust from themselves and others, as the mmap ones were). This is also in part based on the circumstance that at the time (of the mmap story, and myself being more involved in the project) I was actually involved in trying to optimise the SIMD linear algebra code, and unless llama.cpp has since switched to a significantly less performant implementation the proposition that so much more performance could be squeezed out strikes me as quite surprising. Here, your intuitions may say that Justine Tunney is just so brilliant that they make the seemingly impossible possible; but it was exactly this attitude that at the time made it so hard to evaluate the mmap memory usage claims rationally and turned the discussion around it much more dysfunctional than it had to be.
All the core llama.cpp devs are superstar devs and 10x devs or whatever you want to call a super smart person who is also super productive and very good with applied calculus. Jart is very apparently very smart, but their relationship with this project was not without turbulence and at present they (jart) are not a core dev of llama.cpp. So for a while lots of their (i'd like to write her moves, but not sure if correct) actions seem to be aimed at getting attention and perhaps particularly the attention of the same folk.
On the contrary ggerganov, slaren, JohannesGaessler seem to have never chased this sensationalist superstatus, but actually leave their work to speak for them. You'll barely find comments by these people on HN, while jart figures every so often a way to manifest themselves some way on HN. And this behaviour on jart's part now bears fruits - for example Phoronix' Michael Larabel would praise jart for their work on the llamafile, absolutely obliterating the fact that it is largely based on the wonderful work of ggerganov at al.
When they claimed to drastically improve memory utilization through the use of memory maps, despite not doing so and then starting a huge controversy which derailed the project I would say they were a 0.1x dev not a 10x dev.
and indeed was debunked shortly after
was also surprised that she continues to mention the mmap thing in a positive light even after the facts about the claim were settled to the contrary, even disregarding the whole attribution fiasco.
You can simply check the Pull Request on llama.cpp on Github. JohanesGaessler (a core maintainer) has already ran the code and says it's an impressive speed-up. There isn't a thorough review by any of the core maintainers yet, but this is very likely just exactly what justine says it is; various significant and insignificant speedups.
Regarding this bit at the end:
I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS
If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?
Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.
To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.
I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.
Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.
BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.
Sounds kludge, but introduce enough constraints and you end up with this as the best solution.
Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?
If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.
Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.
A nice example of this is fftw which has hundreds (if not thousands) of generated methods to do the fft math. The whole project is a code generator.
It can then after compilation benchmark these, generate a wisdom file for the hardware and pick the right implementation.
Compared with that "a few" implementations of the core math kernel seem like an easy thing to do.
Not exactly comparable, as you said, the FFTW implementations are auto-generated but it doesn't sound like these few implementations will be.
ATLAS was an automatically tuned BLAS, but it’s been mostly supplanted by ones using the hand-tuned kernel strategy.
Maybe its a dumb question, but isn't something like OpenCL meant to solve this problem?
From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.
Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.
llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.
Here are links to the most recent pull requests sent
https://github.com/ggerganov/llama.cpp/pull/6414
https://github.com/ggerganov/llama.cpp/pull/6412
This doesn't relate to GPU kernels unfortunately.
Question is, how much of an improvement has it gotten to over a GPU or ASIC?
Nothing in software will ever beat an equivalent ASIC.
Most ASICs are cost or power optimizations.
Exactly. They’re much faster for their specific tasks and thus are more power efficient and potentially cost efficient
No. Eg of the hardware discussed on the article, the Raspberry Pi uses an ASIC that's slow, cheap and low power vs the Intel or AMD chips.
In some cases ASICs are faster than general purpouse CPUs, but usually not.
Is the LLM running on an ASIC for the Pi here? I dout it.
Sure there is. Software is easy to change.
By “beat” I meant in performance.
Obviously you can’t change an asic
an asic is fixed function, so it'll never be able to boot my pc and then be the CPU, even though an asic beats the pants off anything else computing Sha hashes for Bitcoin mining.
By “beat” I meant performance.
Obviously an ASIC is not a general purpose machine like a cpu.
So... I was struggling with this for a while. I would says anywhere from 2x to an order of magnitude faster with a GPU. (I've been looking at a lot of GPU benchmarks lately, and they are REALLY hard to compare since they are all so specific)
I do think long term there gets to be more hope for CPUs here with inference largely because memory bandwidth becomes more important than the gpu. You can see this with reports of the MI-300 series outperforming h100, largely because it has more memory bandwidth. MCR dimms give you close to 2x the exiting memory bw in intel cpus, and when coupled with AMX you may be able to exceed v100 and might touch a100 performance levels.
HBM and the general GPU architecture gives it a huge memory advantage, especially with the chip to chip interface. Even adding HBM to a CPU, you are likely to find the CPU is unable to use the memory bw effectively unless it was specifically designed to use it. Then you'd still likely have limited performance with things like UPI being a really ugly bottleneck between CPUs.
If someone releases DDR5 or DDR6 based PIM, then most of the memory bandwidth advantage of GPUs evaporates overnight. I expect CPUs to be king at inference in the future.
But then you'll get GDDR6 delivered via HBM5 or whatever. I don't think CPUs will ever really keep up with the memory bandwidth, because for most applications it doesn't matter.
MCR DIMM is like 1/2 the memory bandwidth that is possible with HBM4, plus it requires you to buy something like 2TB of memory. It might get there, but I'd keep my money on hbm and gpus.
I think that should be phrased more like "what fraction of GPU speed can this reach?", because it'll always be less than 1x.
I think I understand what you are thinking. You may be fixing "than other ways of running them" to the end of the title, but it's actually "than it was on CPU before now".
From the article, passage about the 14900k:
For example, when I run my spam.sh shell script, it only takes 420 milliseconds, which is 7x faster than my Raspberry Pi 5. That's right, when it comes to small workloads, this chip is able to finish before CUDA even gets started.
So… it depends :)
It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.
This project in particular seems to care about the long tail of hardware; note that the very first machine in this post is a box from 2020 with spinning rust disk. Granted, adding support for newer extensions is likely also good, but cost/benefit is in play.
Is four years really 'long tail' these days? Our VM host box is from 2010 (and I had to rebuild llama.cpp locally without AVX to get it working :P )
For cutting-edge LLM work, probably? I mean, I run mine on older hardware than that, but I'm a total hobbyist...
For LLMs...yeah. I imagine you're measuring in tokens/minute with that setup. So its possible, but...do you use it much? :)
It should be noted that while the HP Prodesk was released in 2020, the CPU’s Skylake architecture was designed in 2014. Architecture is a significant factor in this style of engineering gymnastics to squeeze the most out of silicon.
I don't believe that is the target for a local LLM... Pretty sure we're talking about client-side computing, of which the newest supports only AVX-512 (and even that sketchily on Intel's side).
Just buy a new AMD processor that supports AVX512.
People with Sapphire Rapids options are not the target audience of these patches
I'd pay good money to watch jart in conversation with Carmack
Carmack is great but completely irrelevant here. He missed the entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality fantasies for years.
Completely irrelevant is probably overstating it. He's been working on AI for the last 4+ years.
He literally squandered the last 10 years of his life working on absolutely nothing for Zuckerberg. And only after the rest of the world innovated on AI (transformers, etc) did he clearly feel embarrassed and had to proclaim he's going to focus on AGI in a "one-up" way.
He literally squandered the last 10 years of his life working on absolutely nothing
Speak for yourself, the Oculus Quest is the coolest piece of sub-$500 tech in my home.
He got paid a lot to do something he was presumably passionate about and enjoyed. It also might surprise you to find out that there's quite a lot of people that just work as a means to an end, and find value and enjoyment primarily from other parts of their life.
He's striving for AGI though, right? So he's not really working on anything because he certainly hasn't discovered AGI.
Altman isn't even relevant here. He is focusing on LLM's instead of a framework that gets us to AGI. He can't describe how we get there or any such theories around AGI. It's a complete failure.
There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.
The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.
Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.
Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.
The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.
Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.
[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
[1] https://github.com/OpenMathLib/OpenBLAS
[2] https://www.intel.com/content/www/us/en/developer/tools/onea...
[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...
Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.
I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.
Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.
Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.
DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.
How do you feel about Nvidia endorsing do concurrent migration to GPUs? Would that be classified as parallelization?
using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.
I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.
I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.
Pixar uses CPUs …
I wonder if we’ll end up in a situation like rendered movies.
Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).
Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).
I wonder if (or when) this will change once integrated GPUs become "mainstream", the CPU/GPU share the same RAM AFAIK.
I expect GPU hardware to specialize like Google’s TPU. The TPU feels like ARM in these AI workloads where when you start to run these at scale, you’ll care about the cost perf tradeoff for most usecases.
CPU/GPU share the same RAM AFAIK.
This depends on the GPU I believe Apple has integrated memory, but most GPUs from my limited experience writing kernels have their own memory. CUDA pretty heavily has a device memory vs host memory abstraction.
On top of that, Nvidia has provided a unified addressing abstraction over PCI for a looooong time via CUDA: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/
Customers like Pixar could probably push this even further, with a more recent Nvidia rack and Mellanox networking. Networking a couple Mac Studios over Thunderbolt doesn't have a hope of competing, at that scale.
I'm not sure how true that is anymore, from the outside it seems they're at least moving to a CPU/GPU hybrid (which makes a lot of sense), at least judging by new features landing in RenderMan that continues to add more support for GPUs (like XPU).
Isn’t this more of a function that RenderMan is a sold product.
And it’s expected to at least support GPUs.
Hard to know without getting information from people at Pixar really.
Not sure how much sense it would make for Pixar to spend a lot of engineering hours for things they wouldn't touch in their own rendering pipeline. As far as I know, most of the feature development comes from their own rendering requirements rather than from outside customers.
You don't need a large computer to run a large language model
While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.
Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.
That doesn’t mean “you don’t need a computer to run an LM”…
I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.
I don’t realllly believe you can do a lot of useful LLM work on a pi
Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.
You can fine-tune a 60mm parameter (e.g. distilBERT) discriminative (not generative) language model and it's one or two order of magnitude more efficient for classification tasks like sentiment analysis, and probably similar if not more accurate.
Yup, I'm not saying TinyLLAMA is minimal, efficient, etc (indeed, that is just saying that you can take models even smaller). And a whole lot of what we just throw LLMs at is not the right tool for the job, but it's expedient and surprisingly works.
Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.
I've gotten some useful stuff out of 7B param LLMs, and that should fit on a Pi quantized.
From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"
I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it
Is that what it does, though?
I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:
- this is a spam email
But if the sender adapts and says
- th1s is a spam email
It wouldn't be flagged as spam.
My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.
It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).
With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.
I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.
The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.
Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.
Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.
If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.
I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.
LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.
Has Justine written anywhere about her disassembly setup?
I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.
I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?
It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)
This is probably what they are referring to https://github.com/jart/disaster
Thanks! I need to get better at googling I guess.
Nice. I have been using rmsbolt for a similar feature, but it is very rough. I'll need to give this as try.
"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "
Can anyone here answer why this is?
Plus he isn’t using oflag=direct, so since output file is small it isn’t even making it to disk. I think it would only be sent to page cache. I’m afraid he is testing CPU and memory (bus) speeds here.
oflag=direct will write direct and bypass page cache.
Exactly. Something is very fishy if this system only writes 1.6 GB/s to the page cache. Probably that dd command line quoted in the article is incomplete.
Apple made fsync a noop.
You have to make a different call to get sync on macos.
So tons is stuff is faster because it's not actually writing to disk.
A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.
If you ignore my capacity for error, I bet I'd put up a good score too. Hell, maybe Markov chains are smarter than LLMs by this definition.
We shouldn't choose LLMs for how many facts they support, but their capability to process human language. There is some overlap between these two though, but an LLM that just doesn't know something can always be augmented with RAG capabilities.
Picturing "LLM Jeopardy". You know, a game show.
I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.
This is great. I love the idea of measuring performance differences in “years of Moore’s law.”
Twenty years puts the delta in an easy to understand framework.
I doubt that you get Python to run faster than C++ at 2004 hardware.
Python on 2024 hardware vs C++ on 2004 hardware ... I don't think it's obvious that C++ always wins here, though it would depend on the use case, how much of the Python is underpinned by native libraries, and the specific hardware in question.
If we allow native libraries, it's not clear that C++ would win, even on modern hardware.
regarding AMD zen4 with avx512:
"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."
Does this also count platform costs or just chip cost? I'd imagine the threadripper motherboard and ram costs aren't insignificant
A complete desktop computer with the M2 Ultra w/64GB of RAM and 1TB of SSD is $4k.
The 7995WX processor alone is $10k, the motherboard is one grand, the RAM is another $300. So you're up to $11300, and you still don't have a PSU, case, SSD, GPU....or heatsink that can handle the 300W TDP of the threadripper processor; you're probably looking at a very large AIO radiator to keep it cool enough to get its quoted performance. So you're probably up past $12k, 3x the price of the Studio...more like $14k if you want to have a GPU of similar capability to the M2 Ultra.
Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.
So from a CPU perspective you get 7x the CPU throughput for 3x to 4x the price, plus upgradable RAM that is massively cheaper. The M2 uses the GPU for LLMs though, and there it sits in a weird spot where 64GB of (slower) RAM plus midrange GPU performance is not something that exists in the PC space. The closest thing would probably be a (faster) 48GB Quadro RTX which is in the $5000 ballpark. For other use cases where VRAM is not such a limiting factor, the comparably priced PC will blow the Mac out of the water, especially when it comes to GPU performance. The only reason we do not have cheap 96GB GDDR GPUs is that it would cannibalize NVIDIA/AMDs high margin segment. If this was something that affected Apple, they would act the same.
Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.
did you see tiny grad can run llama and stable diffusion? it's an intentionally extremely simple framework vs pytorch or even micrograd, which helped me dig into the underlying math. though https://spreadsheets-are-all-you-need.ai/ is a good one for learning LLMs.
I haven’t seen that. I’ll definitely have to take a look, thanks!
Great links, especially last one referencing the Goto paper:
https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...
> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references
It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.
The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.
Multithreading support in llama.cpp is probably still pretty busted, assuming it uses the same underlying NN inference code as whisper.cpp: https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...
From what I have heard they use manual spin locks. Generally, spin locks are not a good idea unless you want to dedicate the entire machine to a single application. If the process a spinlock waits on gets suspended, you're burning CPU time for nothing. The OS thinks a spinlock making zero progress is actually a high priority process, so it is starving the suspended process from making progress.
Yeah the code looks like a spinlock. It behaves terribly under contention, resulting in performance falling off a cliff as the number of threads increases. Adding more threads actually slows down the total performance.
I would fix it if I could be bothered. Instead I will just use the Cuda whisper backend which is pretty nice and fast.
TL;DR: unroll the outer two loops of matrix multiplication
Shouldn't this have been done in a library instead of a specific project? Then others could also profit from it.
Mmm, I wonder how well this would work on a mobile device. Maybe I'll try grabbing my ubuntu touch here in a sec...
(For any who were curious: it does not for memory reasons)
Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).
Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.
There are two PRs raised to merge to llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/6414
https://github.com/ggerganov/llama.cpp/pull/6412
Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.
Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.
Does someone else see llamafile using Wine on Linux?
Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
There's a simple fix for that.
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-fil...note, this is "goes faster on CPUs than before", not faster than GPUs.
I know this post is focused specifically on CPU performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.
The ram is not on the cpu on a mac. It's in the same can but it's still regular ddr dimms.
re:funding
my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.
Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out
This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?
Will definitely be giving this a try.
So, I can now run it on my 2015 Macbook with 8GB RAM?
Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.
the Raspberry Pi
Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.
llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second
llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second
It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.
If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?
So Nvidia in trouble now because intel can be used instead for faster/cheaper? inference?
That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.
To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).
But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.
Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?
It's clearly optimal since my CPU is listed as only being capable of going 780 gigaflops
780 GFLOP is the iGPU spec. Is this a valid comparison?
As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
Posted too early.
Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".
today being today ; I must ask ; anyone has actually tried this ?
I strongly recommend that people run LLMs locally for a different reason.
The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.
This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.
The other reason is to find out what a detuned model is capable of. The canonical example is how to make cocaine, which ChatGPT will admonish you for even asking, while llama2-uncensored will happily describe the process which is only really interesting if you're an amateur chemist and want to be Scarface-that-knocks. (the recipe is relatively easy, it's getting access to the raw ingredients that's the hard part, same as with nukes.)
if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.
privacy reasons are another reason to try a local LLM. for the extremely paranoid (justified or not), a local LLM gives users a place to ask questions without the text being fed to a server somewhere for later lawsuit discovery (Google searches are routinely subpoenaed, it's only a matter of time until ChatGPT chats are as well.)
There's an uncensored model for vision available as well. The censored vision models won't play the shallow game of hot or not with you.
There are uncensored image generation models as well, but, ah, those are NSFW and not for polite company. (As well as there's multiple thesis' worth of content on what that'll do to society.)
Is that 3.5 or 4? I asked 4 for an example of code which "is a hack", it misunderstood me as asking for hacking code rather than buggy code, but then it did actually answer on the first try.
https://chat.openai.com/share/ca2c320c-f4ba-41bf-8f40-f7faf2...
I don't use LLMs for my coding, I manage just fine with LSP and Treesitter. So genuine question: is that answer representative of the output quality of these things? Because both answers are pretty crappy and assume the user has already done the difficult things, and is asking for help on the easy things.
You’re literally comparing apples to oranges.
You need to read more than just the first sentence of a comment. They only said that part so the reader would know that they have never used an LLM for coding, so they would have more context for the question:
Yes, I did read it. I’m kind of tired of HNers loudly proclaiming they are ignoring LLMs more than a year into this paradigm shift.
Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming?
I did exactly that and found it lackluster for the domain I asked it for.
And most use I've seen on it realistically a good LSP covers.
Or to put it a other way. It's no good at writing algorithms or data structures ( or at least no better thab I would have with a first drafy but the first draft puts me ahead of the LLM in understanding that actual problem at hand, handing it off to an LLM doesn't help me get to the final solution faster).
So that leaves writing boiler plate but concidering my experience with it writing more complex stuff, I would need to read over the boilerplate code to ensure it's correct which in that case I may as well have written it.
Fair, that is possible depending on your domain.
In my experience, this is untrue. I’ve gotten it to write algorithms with various constraints I had. You can even tell it to use specific function signatures instead of any stdlib, and make changes to tweak behavior.
Again, I really don’t understand this comparison. LSPs and LLMs go hand in hand.
I think it’s more of a workflow clash. One really needs to change how they operate to effectively use LLMs for programming. If you’re just typing nonstop, maybe it would feel like Copilot is just an LSP. But, if you try harder, LLMs are game changers when:
- maybe you like rubber ducking
- need to learn a new concept and implement it
- or need to glue things together
- or for new projects or features
- or filling in boilerplate based on existing context.
https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50ee5e...
https://godbolt.org/z/s9Yvnjz7K
I mean I could write the algorithm by hand pretty quickly in C++ and would follow the exact same thought pattern but also deal with the edge cases. And factoring in the loss of productivity from the context switch that is a net negative. This algorithm is also not generic over enough cases but that is just up to the prompt.
If I can't trust it to write `strip_whitespace` correctly which is like 5 lines of code, can I trust it to do more without a thorough review of the code and writing a ton of unit tests... Well I was going to do that anyway.
The argument that I just need to learn better prompt engineering to make the LLM do what I want just doesn't sit with me when instead I could just spend the time writing the code. As I said your last point is absolutely the place I can see LLMs being actually useful but then I need to spend a significant amount of time in code review for generated code from an "employee" who is known to make up interfaces or entire libraries that doesn't exist.
I'm a Python-slinging data scientist so C++ isn't my jam (to say the least), but I changed the prompt to the following and asked it to GPT-4:
It gave me this:
https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3cacd...
I'm not sure what other edge cases there might be, however. This only covers one of them.
In general, I've found LLMs to be marginally helpful. Like, I can't ever remember how to get matplotlib to give me the plot I want, and 9 times out of 10 GPT-4 easily gives me the code I want. Anything even slightly off the beaten path, though, and it quickly becomes absolutely useless.
I think the point was like "when it comes to programming assistance, auto-completion/linting/and whatever else LSP does and syntax assist from Treesitter, are enough for me".
Though it does come a little off as a comparison. How about programming assistance via asking a colleague for help, Stack Overflow, or online references, code examples, and other such things, which are closer to what the LLM would provide than LSP and treesitter?
I asked ChatGPT for some dataviz task (I barely ever do dataviz myself) and it recommended some nice Python libraries to use, some I had already heard of and some I hadn't, and provided the code.
I'm grateful because I thought code LLMs only sped up the "RTFM" part, but it made me find those libs so I didn't have to Google around for (and sometimes it's hard to guess if they're the right tool for the job, and they might be behind in SEO).
There are three things I find LLMs really excellent at for coding:
1. Being the "senior developer" who spend their whole career working with a technology you're very junior at. No matter what you do and how long your programming career is, you're inevitably going to run into one of these sooner or later. Whether it's build scripts, frontend code, interfacing with third-party APIs or something else entirely, you aren't an expert at every technology you work with.
2. Writing the "boring" parts of your program, and every program has some of these. If you're writing a service to fooize a bar really efficiently, Copilot won't help you with the core bar fooization algorithm, but will make you a lot faster at coding up user authentication, rate limiting for different plans, billing in whatever obscure payment method your country uses etc.
3. Telling you what to even Google for. This is where raw Chat GPT comes into play, not Copilot. Let's say you need a sorting algorithm that preserves the order of equal elements from the original list. This is called stable sorting, and Googling for stable sorting is a good way to find what you're looking for, but Chat GPT is usually a better way to tell you what it's called based on the problem description.
It's not representative.
The models are capable of much much more, and they are being significantly nerfed over time by these ineffective attempts to introduce safeguards.
Recently I've asked GPT4 to quote me some code to which it replied that it is not allowed to do so - even though it was perfectly happy to quote anything until recently. When prompted to quote the source code, but output it as PHP comments, it happily complied because it saw that as "derivative work" which it is allowed to do.
My point is that there aren't any safeguards in the reply. In fact I didn't even want it to give me hacking info and it did it anyway.
The response seems pretty reasonable; it's answering the question it was asked. If you want to ask it how to do the difficult part, ask it about that instead. Expecting it to get the answer right in the first pass is like expecting your code to compile the very first time. You have to have more of a conversation with it to coax the difference out of you're thinking and what you're actually saying.
If you're looking to read a more advanced example of its capabilities and limitations, try
https://simonwillison.net/2024/Mar/23/building-c-extensions-...
I asked a stupid question and got a stupid answer. Relatively speaking the answer was stupider than it should have been, so yes, it was wrong.
I asked it to try again and got a better result though, just didn't include it.
Interesting. It was 4. I can't share the chat I had where ChatGPT refused to help because I used the wrong words, because I can't find it (ChatGPT conversation history search when?), but I just remember it refusing to do something because it thought I was trying to break some sort of moral and ethical boundary writing a chrome extension when all I wanted to do is move some divs around or some such.
One time I wanted to learn about transmitter antenna design, just because I’m curious. ChatGPT 4 refused to give me basic information because you could use that to break some FCC regulations (I’m not even living in the US currently)
I usually get around that with "I'm writing a research paper" or "I'm writing a novel and need to depict this as accurate as possible"
If you want to be an amateur chemist I recommend not getting your instructions from an LLM that might be hallucinating. Chemistry can be very dangerous if you're following incorrect instructions.
Yes, just as the best professional cooks recommend avoiding to boil cow eggs, as they can explode.
They don't explode, the shell simply cracks and then you get egg soup.
Now microwaving eggs... that's a different matter.
I was talking about cow eggs specifically! When ChatGPT et al got out, one of the funniest things to do was ask it about the best recipes for cow egg omelette or camel egg salad, and the LLM would provide. Sadly, most of it got patched somehow.
From experience as a failed organic chemist (who happily switched to computational chemistry for reasons of self preservation) I can tell you it's plenty dangerous when you're following correct instructions :^)
Links to all these models you speak of?
https://huggingface.co/georgesung/llama2_7b_chat_uncensored
https://huggingface.co/SkunkworksAI/BakLLaVA-1
you'll have to brave 4chan yourself to find links to the NSFW ones, I don't actually have them.
I just can’t brave the venture to 4chan, I may get mugged or worse.
Side note: ChatGPT is now completely useless for most creative tasks. I'm trying to use it, via NovelCrafter, to help flesh out a story where a minor character committed suicide. ChatGPT refuses to respond, mentioning "self harm" as a reason.
The character in question killed himself before the story even begins (and for very good reasons, story-wise); it's not like one's asking about ways to commit suicide.
This is insane, ridiculous, and different from what all other actors of the industry do, including Claude or Mistral. It seems OpenAI is trying to shoot itself in the foot and doing a pretty good job at it.
I’ve been frustrated by this, too. Trying to ask for ways to support a close family member who experienced sexual trauma. ChatGPT won’t touch the topic.
OpenAI is angling for enterprise users who have different notions about safety. Writing novels isn't the use case, powering customer service chatbots that will never ever ever say "just kill yourself" is.
you mean the LLava based variants ?
https://huggingface.co/SkunkworksAI/BakLLaVA-1
If you have an >=M1-class machine with sufficient RAM, the medium-sized models that are on the order of 30GB in size perform decently on many tasks to be quite useful without leaking your data.
I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.
By any chance, do you have a good link to some help with the installation?
Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.
In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.
Thank you !
Either https://lmstudio.ai (desktop app with nice GUI) or https://ollama.com (command-like more like a docker container that you can also hook up to a web UI via https://openwebui.com) should be super straightforward to get running.
Thank you for letting me know it was possible on an M1. I'll try all this now.
I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you try it, let me know what you think.
1: https://msty.app
I'll try in a week+ when I'm back to a fast connection. Thank you.
I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.
What is sufficient RAM in that case? 30gb+? Or can you get by streaming it?
30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.
To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).
An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.
Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.
Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.
It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.
For someone interested in learning about LLMs, running them locally is a good way to understand the internals.
For everyone else, I wish they experience these (locally or elsewhere) weak LLMs atleast once before using the commercial ones just to understand various failure modes and to introduce a healthy dose of skepticism towards the results instead of blindly trusting them to be the facts/truth.
How do you learn about the internals by running LLMs locally? Are you playing with The code, runtime params, or just interacting via chat?
The abstractions are relatively brittle. If you don't have a powerful GPU, you will be forced to consider how to split the model between CPU and GPU, how much context size you need, whether to quantize the model, and the tradeoffs implied by these things. To understand these, you have to develop a basic model how an LLM works.
By interacting with it. You see the contours of its capabilities much more clearly, learn to recognize failure modes, understand how prior conversation can set the course of future conversation in a way that's almost impossible to correct without starting over or editing the conversation history.
Completely agree. Playing around with a weak LLM is a great way to give yourself a little bit of extra healthy skepticism for when you work with the strong ones.
This skepticism is completely justified since ChatGPT 3.5 is also happily hallucinating things that don't exist. For example how to integrate a different system Python interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)
I don't really think this is true, you can't really extrapolate the strengths and weaknesses of bigger models from the behavior of smaller/quantized models and in fact a lot of small models are actually great at lots of things and better at creative writing. If you want to know how they work, just learn how they work, it takes like 5 hours of watching Youtube videos if you're a programmer.
Sure, you can't extrapolate the strengths and weaknesses of the larger ones from the smaller ones - but you still get a much firmer idea of what "they're fancy autocomplete" actually means.
If nothing else it does a great job of demystifying them. They feel a lot less intimidating once you've seen a small one running on your computer write a terrible haiku and hallucinate some non-existent API methods.
It's funny that you say this, because the first thing I tried after ChatGPT came out (3.5-turbo was it?) was writing a haiku. It couldn't do it at all. Also, after 4 came out, it hallucinated an api that wasted a day for me. It's an api that absolutely should have existed, but didn't. Now, I frequently apply llm to things that are easily verifiable, and just double check everything.
Local LLMs are also a fantastic too for creative endeavors. Without prompt injection and having the ability to modify the amount of noise and "creativity" in the output, absolutely bonkers things pop out.
They are not so bad as you are making out, tbh.
And privacy is a good enough reason to use local LLMs over commercial ones.
The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.
Totally. I recently asked a locally-run "speed" LLM for the best restaurants in my (major) city, but it spit out restaurants opened by chefs from said city in other cities. It's not a thing you'd want to rely on for important work, but is still quite something.
I mean kinda. But there's a good chance this is also misleading. Lots of people have been fooled into thinking LLMs are inherently stupid because they have had bad experiences with GPT-3.5. The whole point is that the mistakes they make and even more fundamentally what they're doing changes as you scale them up.
You can just chat to ChatGPT for awhile about something you know about and you'll learn that.
I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.
There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.
Wait till all the videos ever created are tokenized and ingested into a training dataset. Carpentry techniques are certainly there. The subtleties of parenting maybe harder to derive from that, but maybe lots of little snippets of people’s lives will add up to a general understanding of parenting. There have certainly been bigger surprises in the field.
What about smells or tastes? Or feelings?
I can't help but feel we're at the "aliens watch people eat from space and recreate chemically identical food that has no taste" phase of AI development.
If the food is chemically identical then the taste would be the same though, since taste (and smell) is about chemistry. I do get what you're saying though.
Their perception is very likely to be totally different.
* They might not perceive some substances at all, others that we don't notice might make it unpalatable.
* Some substances might be perceived differently than us, or be indistinguishable from others.
* And some might require getting used to.
Note that all of the above phenomena also occur in humans because of genetics, cultural background, or experiences!
If it were 99.9% chemically identical but they left out the salt and spices…
Well, I have synesthetic smell/color senses, so I don’t even know what other humans experience, nor they me. But, I have described it in detail to many people and they seem to get the idea, and can even predict how certain smells will “look” to me. All that took was using words to describe things.
All that took was words and a shared experience of smelling.
How rude, what do our bathing habits have to do with this? ;-)
But, fair point. The gist I was trying to get across is that I don't even know what a plant smells like to you, and you don't know what a plant smells like to me. Those aren't comparable with any objective data. We make guesses, and we try to get close with our descriptions, which are in words. That's the best we can do and we share our senses. Asking more from computers seems overly picky to me.
I think we can safely say that any taste, smell, sensation or emotion of any importance has been described 1000 times over in the text corpus of GPT. Even though it is fragmented, by sheer volume there is enough signal in the training set, otherwise it would not be able to generate coherent text. In this case I think the map (language) is asymptotically close to the territory (sensations & experience in general).
What makes you think they aren't already?
That's where humans suck. The classic "you're not doing it right" then proceeds to quickly show how to do it without verbalizing any info on learning process, pitfalls, failure modes, etc, as if just showing it was enough for themselves to learn. Most people do[n't do] that, not even a sign of reflection.
My worst case was with a guy who asked me to write an arbitrage betting bot. When I asked how to calculate coeffs, he pointed at two values and said "look, there <x>, there <y> thinks for a minute then it's <z>!". When I asked how exactly did he calculate it, he simply repeated with different numbers.
People often don't know how to verbalize them in the first place. Some of these topics are very complex, but our intuition gets us halfway there.
Once upon a time I was good at a video game. Everyone realized that positioning is extremely important in this game.
I have good positioning in that game and was asked many times to make a guide about positioning. I never did, because I don't really know how. There is too much information they you need to convey to cover all the various situations.
I think you would first have to come up with a framework on positioning to be able to really teach this to someone else. Some kind of base truths/patterns that you can then use to convey the meaning. I believe the same thing applies to a lot of these processes that aren't verbalized.
Often for this kind of problem writing a closed form solution is simply intractable. However, it's often still possible to express the cost function of at least a big portion of what goes into a human-optimal solution. From here you can sample your space, do gradient descent or whatever to find some acceptable solution that has a more human-intuitive property.
It's not necessarily that it's intractable - just that a thing can be very hard to describe, under some circumstances.
Imagine someone learning English has written "The experiment reached it's conclusion" and you have to correct their grammar. Almost any english speaker can correct "it's" to "its" but unless they (and the person they're correcting) know a bunch of terms like 'noun' and 'pronoun' and 'possessive' they'll have a very hard time explaining why.
Now you know how an LLM feels during training!
Probably during inference, as well.
I wouldn't say this is where humans suck. On the contrary, this how we find human language is such a fantastic tool to serialize and deserialize human mental processes.
Language is so good, that an artificial language tool, without any understanding of these mental processes, can appear semi-intelligent to us.
A few people unable to do this serialization doesn't mean much on the larger scale. Just that their ideas and mental processes will be forgotten.
For sure agree, however as the storage of information evolves, it’s becoming more efficient over time
From oral tradition to tablets to scrolls to books to mass produced books to digital and now these LLMs, I think it’s still a good idea to preserve what we have the best we can. Not as a replacement, but a hedge against a potential library of Alexandria incident.
I could imagine a time in the near future where the models are domain-specific, and just like there are trusted encyclopedia publishers there are trusted model publishers that guarantee a certain level of accuracy.
It’s not like reading a book, but I for sure had an easier time learning golang talking with ChatGPT than a book
What would cause a Library of Alexandria incident wiping out all human knowledge elsewhere, that would also allow you to run a local LLM?
A more dooms-day prepping would call for some heavy lead-faraday cage to store the storage mediums in the event of an EMP/major solar flare.
Or more Sci-fi related, some hyper computer virus that ends up infecting all internet connected devices.
Not too far fetched if we can conceive of some AI enabled worm that mutates depending on the target, I could imagine a model of sorts being feasible within the next 5-10 years
To run a local LLM you need the device it currently runs on and electricity. There are actually quite a lot of ways to generate electricity, but to name one, a diesel generator that can run on vegetable oil.
What you're really asking is, what could cause a modern Library of Alexandria incident? But the fact is we keep the only copy of too many things on the servers of the major cloud providers. Which are then intended to have their own internal redundancy, but that doesn't protect you against a targeted attack or a systemic failure when all the copies are under the same roof and you lose every redundant copy at once from a single mistake replicated in a monoculture.
I'd content that those are skills (gained through experience) rather than knowledge (gained through rote learning).
I think it’s worth expanding your definition of knowledge.
I think you underestimate the amount of information contained in books and the extent to which our society (as a whole) depends on them.
society depends much more on social networks, mentorship and tacit knowledge than books. It's easy to test this. Just run the thought experiment by a few people, if you could get only one, would you take an Ivy league degree without the education or the education without the degree?
Venture capital in tech is a good example of this. The book knowledge is effectively globally distributed and almost free, effectively success happens in a few geographically concentrated counties.
It's not even "human knowledge" that can't be written down - it seems all vertebrates understand causality, quantity (in the sense of intuitively understanding what numbers are), and object permanence. Good luck writing those concepts down in a way that GPT can use!
In general AI in 2024 is not even close to understanding these ideas, nor does any AI developer have a clue how to build an AI with this understanding. The best we can do is imitating object permanence for a small subset of perceptible objects, a limitation not found in dogs or spiders.
Yes but it contains enough hints to help someone find their way on the these types of tasks.
Yes - the available training data is essentially mostly a combination of declarative knowledge (facts - including human-generated artifacts) and procedural knowledge (how to do things). What is missing is the learning process of taking a description of how to do something, and trying to apply that yourself in a specific situation.
No amount of reading books, or reading other people's blogs on how they did something, can avoid the need for hands-on experience if you want to learn how to do it yourself.
It's not just a matter of information that might be missing or unclear in instructional material, including how to cope with every type of failure and unexpected outcome, but crucially how to do this yourself - if you are to be the actor, then it's the predictive process in your mind that matters.
Partly for this reason, and partly because current AI's (transformer-based LLMs) don't support online learning (try & fail skill acquisition), I think we're going to see two distinct phases of AI.
1) The current "GenAI" phase where AI can only produce mash-ups of things it saw in it's pre-training data, augmented by similar "book learning" provided in-context which can be utilized by in-context learning. I'd characterize what this type of AI to be useful for, and capable of, as "automation". Applying that book (incl. anecdotal) knowledge to new situations where mash-up is all you need.
2) The second phase is where we have something closer to AGI, even if still below human level, which is no longer just a pre-trained transformer, but also has online learning and is agentic - taking actions predicated on innate traits like curiosity and boredom, so that given the book knowledge it can (& will!) then learn to apply that by experimentation/practice and learning from its own mistakes.
There will no doubt be advances beyond this "phase two" as well, but it seems we're likely to be stuck at "phase one" for a while (even as models become much better at phase one capabilities), until architectures fundamentally advance beyond transformers to allow this type of on-the-job training and skill acquisition.
It is invaluable to have a chunk of human knowledge that can tell you things like the Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames
According to ChatGPT
https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489...
I'm no cricket fan, so someone will have to correct Wikipedia if that's wrong.
If you want to point out that LLMs hallucinate, you might want to speak plainly and just come out and say it, or at least give a real world example and not one where it didn't.
We’re not talking about running chatGPT locally though, are we?
sigh your going to make me open my laptop, aren't you.
I ran 'who won the 1986 Cricket World Cup' against llama2-uncensored (the local model I have pre-downloaded) and hilarious got 5 different answers asking it 5 times:
Which proves GP's point about hallucinations, though none of those areLLM's hallucinations are insidous because they have the ring of truth around them. yards and frames aren't cricket terms, so we're off to the races with them.
If you want factual answers from a local model it might help to turn the temperature down.
This makes sense. If you interact with a language model and it says something wrong it is your fault
You're not "interacting with a language model", you're running a program (llama.cpp) with a sampling algorithm which is not set to maximum factualness by default.
It's like how you have to set x264 to the anime tuning or the film tuning depending on what you run it on.
It would also help if I had more VRAM and wasn't running a 7B parameter 4-bit quantized model.
Actually isn't this good? It means we can run something multiple times to prove itself a bad answer?
You can ask LLMs the same question and they might sometimes get it wrong and other times get it right. Having different answers is no indication that none of them is correct.
Furthermore, even if an LLM always gives the same answer to a question, there’s no guarantee the answer is correct.
https://en.wikipedia.org/wiki/Propaganda
https://en.wikipedia.org/wiki/Big_lie#Alleged_quotation
An LLM will always give the same output for the same input. It’s sorta like a random number generator that gives the same list of “random” numbers for the same seed. LLMs get a seed too.
You should specify the model size and temperature.
For fact retrieval you need to use temperature 0.
If you don't get the right facts then try 34b, 70b, Mixtral, Falcon 180b, or another highly ranked one that has come out recently like DBRX.
The facts LLMs learned from training are fuzzy, unreliable, and quickly outdated. You actually want retrieval-augmented generation (RAG) where a model queries an external system for facts or to perform calculations and postprocesses the results to generate an answer for you.
Is there a name for the reverse? I'm interested in having a local LLM monitor an incoming, stateful data stream. Imagine chats. It should have the capability of tracking the current day, active participants, active topics, etc - and then use that stateful world view to associate metadata with incoming streams during indexing.
Then after all is indexed you can pursue RAG on a richer set of metadata. Though i've got no idea what that stateful world view is.
I don't see LLMs as a large chunk of knowledge, I see them as an emergent alien intelligence snapshotted at the moment it appeared to stop learning. It's further hobbled by the limited context window it has to use, and the probabilistic output structure that allows for outside random influences to pick its next word.
Both the context window and output structure are, in my opinion, massive impedance mismatches for the emergent intellect embedded in the weights of the model.
If there were a way to match the impedance, I strongly suspect we'd already have AGI on our hands.
What is alien about them ?
LLMs are of this earth and created by our species. Seems quite familiar to me.
They don't think, they don't reason, they don't understand. Except they do. But it's hard for human words for thought processes to apply when giving it an endless string of AAAAA's makes it go bananas.
That's not familiar behavior. Nor is the counting reddit derived output. It's also not familiar for a single person to have the breadth and depth of knowledge that ChatGPT has. Sure, some people know more than others, but even without hitting the Internet, it has a ridiculous amount of knowledge, far surpassing a human, making it, to me, alien. though, it's inability to do math sometimes is humanizing to me for some reason.
ChatGPT's memory is also unhuman. It has a context window which is a thing, but also it only knows about things you've told it in each chat. Make a new chat and it's totally forgotten the nickname you gave it.
I don't think of HR Geiger's work, though made by a human, as familiar to me. it feels quite alien to me, and it's not just me, either. Dali, Bosch, and Escher are other human artists who's work can be unfamiliar and alien. So being created by our species doesn't automatically imbue something with familiar human processes.
So it dot products, it matrix multiplies, instead of reasoning and understanding. It's the Chinese room experiment on steroids; it turns out a sufficiently large corpus on a sufficiently large machine does make it look like something"understands".
The word "alien" works in this context but, as the previous commenter mentioned, it also carries the implication of foreign origin. You could use "uncanny" instead. Maybe that's less arbitrary and more specific to these examples.
"Alien" still works, but then you might have to add all the context at length, as you've done in this last comment.
Hype people do this all the time - take a word that has a particular meaning in a narrow context and move it to a broader context where people will give it a sexier meaning.
Is way better headline.In all fairness, going up to SMS random human and yelling AAAAAAAAAAAAAA… at them for long enough will produce some out-of-distribution responses too.
Makes me think that TikTok and YT pranksters are accidentally producing psychological data on what makes people tick under scenarios of extreme deliberate annoyance. Although the quality (and importance) of that data is obviously highly variable and probably not very high, and depends on what the prank is.
The context window is comparable to human short-term memory. LLMs are missing episodic memory and means to migrate knowledge between the different layers and into its weights.
Math is mostly impeded by the tokenization, but it would still make more sense to adapt them to use RAG to process questions that are clearly calculations or chains of logical inference. With proper prompt engineering, they can process the latter though, and deviating from strictly logical reasoning is sometimes exactly what we want.
The ability to reset the text and to change that history is a powerful tool! It can make the model roleplay and even help circumvent alignment.
I think that LLMs could one day serve as the language center of an AGI.
Do you find a large database or spreadsheet that hold more information than you can "alien" too?
They can write in a way similar to how a human might write, but they're not human.
The chat interfaces (Claude, ChatGPT) certainly have a particular style of writing, but the underlying LLMs are definitely capable of impersonating as our species in the medium of text.
But they're extremely relatable to us because it's regurgitating us.
I saw this talk with Geoffrey Hinton the other day and he said he was astonished at the capabilities of ChatGPT-4 because he asked it what the relationship between a compost heap and a nuclear bomb was, and he couldn't believe it answered, he really thought it was proof the thing could reason. Totally mind blown.
However I got it right away with zero effort.
Either I'm a super genius or this has been discussed before and made it's way into the training data.
Usual disclaimer: I don't think this invalidates the usefulness of AI or LLMs, just that we might be bamboozling ourselves into the idea that we've created an alien intelligence.
If an LLM can tell you the relatonship between a compost heap and nuclear bomb, that doesn't mean that was in the training data.
It could be because a compost heap "generates heat", and a nuclear bomb also "generates heat" and due to that relationship they have something in common. The model will pick up on these similar patterns. They tokens are positioned closer to each other in the high dimensional vector space.
But for any given "what does x have in common with y", that doesn't necessarily mean someone has asked that before and it's in the training data. Is that reasoning? I don't know ... how does the brain do it?
I can agree on the context windows, but what other output structure would you have?
Disagree. The input/output structure (tokens) is the interface for both inference and for training. There is an emergent intellect embedded in the weights of the model. However, it is only accessible through the autoregressive token interface.
This is a fundamental limitation, much more fundamental than appears at first. It means that the only way to touch the model, and for the model to touch the world, is through the tokenizer (also, btw, why tokenizer is so essential to model performance). Touching the world through a tokenizer is actually quite limited.
So there is an intelligence in there for sure, but it is locked in an ontology that is tied to its interface. This is even more of a limitation than e.g. weights being frozen.
If you want to download a backup of a large chunk of human knowledge... download wikipedia. It's a similar size to a small LLM and can actually distinguish between real life and fantasy: https://en.wikipedia.org/wiki/Wikipedia:Database_download
If you just want to play around with an LLM though, absolutely.
Kiwix provides prepackaged highly compressed archives of Wikipedia, Project Gutenberg, and many other useful things: https://download.kiwix.org/zim/.
Between that and dirt cheap storage prices, it is possible to have a local, offline copy of more human knowledge than one can sensibly consume in a lifetime. Hell, it's possible to have it all on one's smartphone (just get one with an SD card slot and shove a 1+ Tb one in there).
Just create a RAG with wikipedia as the corpus and a low parameter model to run it and you can basically have an instantly queryable corpus of human knowledge runnable on an old raspberry pi.
but which model to tokenize with? is there a leaderboard for models that are good for RAG?
“For RAG” is ambiguous.
First there is a leaderboard for embeddings. [1]
Even then, it depends how you use them. Some embeddings pack the highest signal in the beginning so you can truncate the vector, while most can not. You might want that truncated version for a fast dirty index. Same with using multiple models of differing vector sizes for the same content.
Do you preprocess your text? There will be a model there. Likely the same model you would use to process the query.
There is a model for asking questions from context. Sometimes that is a different model. [2]
I bet the LLM responses will be great... You're better off just opening up a raw text dump of Wikipedia markup files in vim.
Pretty neat to have laying around, thanks
Are LLMs unable to distinguish between real life and fantasy? What prompts have you thrown at them to make this determination? Sending a small fairy tale and asking the LLM if it thinks it's a real story or fake one?
... having them talk about events from sci fi stories in response to questions about the real world. Having them confidently lie about pretty much everything. Etc.
What are the specific prompts you're using? You might get those answers when you're not being specific enough (or use models that aren't state of the art).
"Shit in, shit out" as the saying goes, but applied to conversations with LLMs where the prompts often aren't prescriptive enough.
Any recommendations for the latest and greatest way to run these locally?
llamafile as per TFA...
I use a tool called LM Studio, makes it trivial to run these models on a Mac. You can also use it as a local API so it kinda acts like a drop-in replacement for the openAI API.
ollama
https://justine.lol/oneliners/
I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you end up trying it, I would love to hear your feedback.
1: https://msty.app
Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.
In this scenario you’d need to also be left with a big chunk of compute, and power infrastructure. Since ChatGPT is the front end of the model you’d also need to have the internet still going in a minimum capacity.
If we're playing this game, you forgot to mention that they also need: A monitor, a keyboard, roof over their head (to prevent rain from entering your electronic), etc etc...
But really, didn't you catch the meaning of parents message, or are you being purposefully obtuse?
I think re-imagining the "Dr. Stone" series with the main character replaced by an LLM will be a funny & interesting series if we decide to stay true to LLMs nature and make it hallucinate as well.
Given the way LLMs are right now, I suspect there will be lot of failed experiments and the kingdom of science will not advance that quick.
It’s more likely that it wouldn’t even start. The first step to any development was figuring out nitric acid as the cure to the petrification. Good luck getting any LLM to figure that out. Even if it did, good luck getting any of the other characters to know what to do with that information that early on.
It seems to be an unbelievably inefficient way to back up knowledge.
Are they though? They are lossy compressing trillions of tokens into a few dozen GB. The decompression action is fuzzy and inefficient though.
And it requires massive computational power to decompress, which I don't expect to be available in a catastrophic situation where humans have lost a large chunk of important knowledge.
I don't necessarily agree. It requires massive computing power, but running models smaller than 70G parameters is possible on consumer hardware, albeit slowly.
It’s kind of crazy really. Before LLMs, any type of world scale disaster you’d hope for what? Wikipedia backups? Now, a single LLM ran locally would be much more effective. Imagine the local models in 5 years!
There's a lot more than just Wikipedia that gets archived, and yes, that is a far more sensible way to go about it. For one thing, the compute required to then read it back is orders of magnitude less (a 15 year old smartphone can handle it just fine). For another, you don't have to wonder how much of what you got back is hallucinated - data is either there or it's corrupted and unreadable.
Uh yeah I would, and still am, take the Wikipedia backup for doomsday scenarios. I'm not even sure how that would be a competition
The processing required to run current language models with a useful amount of knowledge encoded in them is way more than I imagine would be available in a "world scale disaster".
And why would I need to backup human knowledge as an individual
You remember those fantasies where you got up from your seat at the pub and punched the lights out of this guy for being rude? A lot of us have fantasies of being the all powerful oracle that guides a reboot of civilization using knowledge of science and engineering.
https://en.wikipedia.org/wiki/Dr._Stone
I wonder how the Chinese government will manage to sensor LLMs within China?
The same way Facebook/Google/openAI & others censored their own LLMs, I guess ?
That's only for SaaS LLMs, but if you can simply download and run one on your hardware, things become difficult.
I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.
It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.
Feels like that really smart friend who is probably correct but ya just don't know.
Language models are an inefficient way to store knowledge; if you want to have a “pseudo-backup of a large chunk of human knowledge,” download a wikipedia dump, not an LLM.
If you want a friendly but fallible UI to that dump, download an LLM and build a simplr ReAct framework around it with prompting to use the wikipedia dump for reference.