Perhaps somewhat off topic, does anyone know how stuff like ggml compares to runtimes (tensorflow lite, onnxruntime, etc.)?
With all such optimizations, isn't llama too slow to be run on a CPU with a reasonable amount of parameters?
I'm gpu poor so I use cpu to run large models. For example, llamafile running mixtral 8x22b on my threadripper can chew through long legal documents and give me advice in a few minutes, using f16 weights and kahan summation. Show me someone who has that many graphics cards.
Are you "gpu poor" by choice? I would have thought you, of all people, would be "gpu wealthy" like few can imagine.
I'm a hobbyist developer. I haven't been employed in six years. In the past few months I've been fortunate enough to find work as a TVC for Mozilla, which has been wonderful. It's a good lifestyle that affords me a great deal of freedom and autonomy, but certainly not a lot of GPUs. I have an RTX 4090, a few XTX cards, and I rent stuff from Vast occasionally, but that's about it.
I love how RTX 4090 is now "GPU poor"
It can't run most of the open source models being released today. It's not useful if your job is to keep up with LLM releases.
What are your tokens/s on that setup?
I don't see how that conclusion follows from this optimization.
Models get better for equal parameters, CPUs get faster and good folks at intel, amd and arm are probably working very hard to catch up with apple silicon in terms of memory architecture. I can see this be very relevant in a couple of years.
How much do these silu and softmax improvements affect the LLM inference speed as a whole? Correct me if I'm wrong but I feel that this change will only have a small effect as the majority of the time is spent doing matrix multiplications.
Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.
Why does it disproportionately use bandwidth?
In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.
Wouldn’t the softmax typically be “fused” with the matmul though?
Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)
Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!
Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?
If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.
About 20 years ago, I was programming for the Hughes radar signal processor, a highly parallel pipelined machine which accounted for much of Hughes success in radar processing. Anyway, I needed to compute e^x for 0 < x < 1. The processor had a multiply, so I used four 256 long tables of e^x for each possible 8-bit values in the 4 blocks in the 32-bit word, multiplied them to get the final value. It was about 5 times as fast as the previous best e^x routine. That machine was fun! It is obsolete now, but for many years is could process radar signals faster that processors that were nominally many times faster.
In case anyone else wasn't quite following this, I think the idea is
e^x
= e^(a+b+c+d) (where abcd are the bytes of x)
= e^a * e^b * e^d * e*d
and then you make a lookup table for each of those e^a, e^b values.
(I fudged "a" here, where it's really more like "high byte << 24", but I think you just make your lookup table for e^a be a map from a => e^(a<<24), and similar for the other bytes.)
When I took Huffman's Cybernetics course as an undergrad, his tests were full of multiplication and you couldn't use a calculator, so he provided a table of exponents and logarithms, and you'd look up in one table, sum the values, then look up the result in another table. I was not a fan.
This is literally what a sliderule is.
In their day, slide rules were the most efficient easily available option.
Slide rules were a great teaching aide, because they required you to (1) keep track of powers of 10 for yourself, and (2) understand the exponential difficulty of seeking more digits in a result.
The first point lets you guess the order of magnitude of a result before doing the detailed calculation, and that is a great way to find errors.
The second one reveals why teachers get so annoyed when 9-digit answers are given to problems with 2-digit data.
I do wish some company would start producing cheap plastic slide rules again, so my students could try them out. The online ones are not as effective as actually seeing something in the real world, squinting your eyes to guess where the cursor lies between two lines on the rule. (Hm, I can get more digits on the left end of the rule than I can on the right ...)
ThinkGeek sold a cheap plastic slide rule once. Its motion was too sticky for me to, well, stick with learning to use it. I agree there's a good product idea here if you can do it better.
A good YouTube channel “Welch labs” recently did a good history of the logarithm and how it revolutionized the speed at which people could do math
replaces short[65536] look up table
Is that not quite dim to begin with (having a LUT the size of the whole L1 cache?) or does it work surprisingly well because of some probabilistic fudging?
Why you (probably) shouldn't use a lookup table https://specbranch.com/posts/lookup-tables/ gives some discussion about when it's appropriate generally. My narrow experience is that you can do a lot of real time calculation before it's faster to do a lookup.
One case that doesn't mention explicitly: sometimes your problem is lucky enough that you have the choice between:
1. for each function, for each datum, call function with datum
2. for each datum, for each function, call function with datum
For 1, if your main data is fetched predictably but does not saturate bandwidth, it can be beneficial to use even a fairly large LUT. Note also that while actual L1i is semi-ignorable, e.g. branch prediction benefits much more from remaining "in cache".
For 2, assuming other functions also have their own miscellaneous data, the non-LUT approach is likely to win. But you should probably benchmark against #1.
But sometimes you can get the best of both worlds:
3. for each arbitrary chunk of data (smaller than L3, or maybe even L2 depending on when your throughput saturates), for each function, for each datum in the chunk, call function with datum
Yes, you should of course do whole-system benchmarks, but being able to guess helps you write a useful implementation to benchmark in the first place.
In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.
Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.
Also, in the interest of disclosure, I wrote the blog post you are responding to.
For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)
The lookup table does surprisingly well because the workload is otherwise extremely cache-hostile, and it doesn't really matter if you blow up your L1 cache, none of the data you evicted because you needed to fit the LUT was ever going to be reused anyway.
ML loads in general are streaming loads that linearly load the entire dataset for every iteration.
One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.
* Not quite all of it - There are still buffers to do write combining and some read caching on scans.
off topic, but sheesh. I was skimming this and thought, "This seems like a crazy optimization. It's complex and the code in question has had a ton of eyes on it already." and then I saw the contributor and was like "Of course it's jart. It's always jart with the crazy good solutions."
well done :)
Its not something that they teach at asymptotic analysis lectures, right? which reminds me of professor who famously said - that constant that everyone's just overlooking can in engineering terms very much eat your head.
A big C could be $3M in losses per year vs $2M in profits.
Or always coming in second on contracts or sales and getting $0.
Unrelated but I see this sentiment a lot. Algorithms (and related courses) are supposed to be largely mathematical/theory based so it sort of has to be hardware agnostic.
Imo an algorithms course doesn't really need to teach you about lowering the constant factor (unless the optimizations for lowering said factor are reasonably abstracted away from physical HW), thats the purpose of Computer Architecture or Operating Systems course.
And realistically in the real world you'd be benchmarking these things anyway if a supposedly "optimal" algorithm is really the bottleneck, and you'd apply context specific optimizations to speed it up.
When I did tutoring for an algorithms class, we’d generally make gestures at the fact that the big-O cost doesn’t tell you everything. But I can’t think of any way to handle that on a systemic level. Just plot n vs t and observe they the big-O prediction comes true for large n, but it gets complicated for small.
After all, nobody would have invented big-O if it wasn’t much easier than the alternative!
Mostly looks scary* because that's just how it is with intrinsics syntax in C and C++. As with many things there, this pain is mostly self-inflicted.
There are C++ libraries that allow for C#-style SIMD and hardware instrinsics syntax as far as I'm aware. It comes at a disadvantage as you can't directly lookup mnemonics in ISAs documentation though.
*not to seem as if dismissing the importance of the work done there, just highlighting that it could have been much more accessible to wider audience, but I'm not gonna suggest something that everyone here would consider preposterous such as rewriting inference back-end in C# just yet
adapted from arm limited optimized routine
Shoulders of giants and all that
(in llama.cpp, for CPU)
I developed this originally for llamafile, which was included in the last two releases: https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.2 Now we're upstreaming it to the llama.cpp project. There are other performance enhancements you can currently only get from llamafile, such as Kawrakow's work making K quants go much faster.
Wait, computing SiLU directly using some numerical analysis is probably a lot faster than doing an exp each time. Is there a significant perf impact to doing this?
With expf() most of the work had already been done and I could kill two birds with one stone. If you want to do the math for doing SiLU directly, that'd be an awesome change I'd happily merge into llamafile. You might even be able to get that into PyTorch and even bigger name projects.
Is that just because nobody has made an effort yet to port them upstream, or is there something inherently difficult about making those changes work in llama.cpp?
I get the impression most llama.cpp users are interested in running models on GPU. AFAICT this optimization is CPU-only. Don't get me wrong – a huge one! – and opens the door to running llama.cpp on more and more edge devices.
One can vectorize LUTs as well https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
I wrote about the kinds of things that are possible with LUTs awhile back https://darkcephas.blogspot.com/2018/10/validating-utf8-stri...
Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.
With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)
Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.
I just went with straight math for single-precision, though.
On similar lines, faster tanh https://github.com/microsoft/onnxruntime/pull/20612
Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().
At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?
The difference is negligible on 2x 4090. There are more important differences like 4 bit KV cache.
This does help the gguf usecase of partial offloading to GPU? It'll help the CPU part be faster too?
I'm intimately familiar, maintain ONNX and llama.cpp Flutter libraries across all 6 True Platforms.
Quick opinionated TL;DR:
- llama.cpp for LLMs and can do whisper with it's core dependency, GGML.
- ONNX for everything else.
- TF is the Apple of ML, it's great if you're completely wedded to Google ML ecosystem. Virtually dead outside that. (Something absurd ,like, 94%, of HF models are Pytorch)
- only chance I'd have to do a direct comparison in inference performance is Whisper in ONNX vs. GGML, someone got my llama .cpp lib running with Whisper and didn't report significant perf difference
I don’t think that’s a terribly fair description of Google - AWS’s chips (inferon and trainium) both have robust XLA/JAX support. Plus JAX now exports MLIR, so there is a really compelling JAX -> IREE pipeline so JAX models can more or less be deployed anywhere, even on bare metal embedded devices.
You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.
For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.
And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)
* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.
I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.
I know.
OP didn't mention JAX, or IREE.
I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.
Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.
[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.
Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.
Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.
What are the True Platforms, in this case?
I'm guessing MacOS, Linux, Android, Windows, iOS, and Web?
Correct
What’re your thoughts on MLX? It’s been phenomenal on my MBP.
No time* to try it unfortunately :( Sounds great, though, and Mac just kicks the pants of every other platform on local inference thanks to Metal, I imagine MLX must extend that lead to the point Qualcomm/Google has to a serious investment in open source acceleration. Cheapest iPhone from 2022 kicks the most expensive Android from 2023 (Pixel Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.
* it sounded like a great idea to do an OpenAI LLM x search app. Then it sounded like a great idea to add embeddings locally for privacy (thus, ONNX). Then it sounded like a great idea to do a local LLM (thus, llama.cpp). Then it sounded like a great idea to differentiate by being on all platforms, supported equally. Really taxing. Think I went too far this time. It works but, jeez, the workload...hopefully, after release, it turns out maintenance load is relatively low
On what hardware exactly?
Probably should have specified that! I’m referring to cpu inference.
On what CPU?
Typically I'm considering embedded applications (armv8, armv7)