return to table of content

New exponent functions that make SiLU and SoftMax 2x faster, at full accuracy

koe123
15 replies
21h37m

Perhaps somewhat off topic, does anyone know how stuff like ggml compares to runtimes (tensorflow lite, onnxruntime, etc.)?

refulgentis
10 replies
21h16m

I'm intimately familiar, maintain ONNX and llama.cpp Flutter libraries across all 6 True Platforms.

Quick opinionated TL;DR:

- llama.cpp for LLMs and can do whisper with it's core dependency, GGML.

- ONNX for everything else.

- TF is the Apple of ML, it's great if you're completely wedded to Google ML ecosystem. Virtually dead outside that. (Something absurd ,like, 94%, of HF models are Pytorch)

- only chance I'd have to do a direct comparison in inference performance is Whisper in ONNX vs. GGML, someone got my llama .cpp lib running with Whisper and didn't report significant perf difference

catgary
4 replies
19h54m

I don’t think that’s a terribly fair description of Google - AWS’s chips (inferon and trainium) both have robust XLA/JAX support. Plus JAX now exports MLIR, so there is a really compelling JAX -> IREE pipeline so JAX models can more or less be deployed anywhere, even on bare metal embedded devices.

refulgentis
3 replies
19h41m

You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.

For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.

And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)

* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.

catgary
2 replies
19h19m

I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.

refulgentis
1 replies
19h6m

I didn’t mention tensorflow or TF Lite

I know.

OP didn't mention JAX, or IREE.

I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.

Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.

[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.

catgary
0 replies
17h59m

Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.

Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.

svnt
2 replies
20h52m

What are the True Platforms, in this case?

crthpl
1 replies
20h46m

I'm guessing MacOS, Linux, Android, Windows, iOS, and Web?

refulgentis
0 replies
19h53m

Correct

a_wild_dandan
1 replies
20h5m

What’re your thoughts on MLX? It’s been phenomenal on my MBP.

refulgentis
0 replies
19h34m

No time* to try it unfortunately :( Sounds great, though, and Mac just kicks the pants of every other platform on local inference thanks to Metal, I imagine MLX must extend that lead to the point Qualcomm/Google has to a serious investment in open source acceleration. Cheapest iPhone from 2022 kicks the most expensive Android from 2023 (Pixel Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.

* it sounded like a great idea to do an OpenAI LLM x search app. Then it sounded like a great idea to add embeddings locally for privacy (thus, ONNX). Then it sounded like a great idea to do a local LLM (thus, llama.cpp). Then it sounded like a great idea to differentiate by being on all platforms, supported equally. Really taxing. Think I went too far this time. It works but, jeez, the workload...hopefully, after release, it turns out maintenance load is relatively low

ilaksh
3 replies
21h26m

On what hardware exactly?

koe123
2 replies
20h17m

Probably should have specified that! I’m referring to cpu inference.

ilaksh
1 replies
18h37m

On what CPU?

koe123
0 replies
10h47m

Typically I'm considering embedded applications (armv8, armv7)

DeathArrow
8 replies
12h22m

With all such optimizations, isn't llama too slow to be run on a CPU with a reasonable amount of parameters?

jart
5 replies
11h54m

I'm gpu poor so I use cpu to run large models. For example, llamafile running mixtral 8x22b on my threadripper can chew through long legal documents and give me advice in a few minutes, using f16 weights and kahan summation. Show me someone who has that many graphics cards.

bsenftner
3 replies
6h56m

Are you "gpu poor" by choice? I would have thought you, of all people, would be "gpu wealthy" like few can imagine.

jart
2 replies
6h20m

I'm a hobbyist developer. I haven't been employed in six years. In the past few months I've been fortunate enough to find work as a TVC for Mozilla, which has been wonderful. It's a good lifestyle that affords me a great deal of freedom and autonomy, but certainly not a lot of GPUs. I have an RTX 4090, a few XTX cards, and I rent stuff from Vast occasionally, but that's about it.

baobabKoodaa
1 replies
3h6m

I love how RTX 4090 is now "GPU poor"

jart
0 replies
2h52m

It can't run most of the open source models being released today. It's not useful if your job is to keep up with LLM releases.

kkzz99
0 replies
8h10m

What are your tokens/s on that setup?

nxpnsv
0 replies
12h17m

I don't see how that conclusion follows from this optimization.

baq
0 replies
12h15m

Models get better for equal parameters, CPUs get faster and good folks at intel, amd and arm are probably working very hard to catch up with apple silicon in terms of memory architecture. I can see this be very relevant in a couple of years.

mysteria
7 replies
20h24m

How much do these silu and softmax improvements affect the LLM inference speed as a whole? Correct me if I'm wrong but I feel that this change will only have a small effect as the majority of the time is spent doing matrix multiplications.

terafo
6 replies
19h26m

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

cgearhart
4 replies
19h11m

Why does it disproportionately use bandwidth?

jacobn
3 replies
18h15m

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

bjornsing
1 replies
14h48m

Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2
0 replies
13h7m

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

cgearhart
0 replies
17h56m

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

tehsauce
0 replies
18h33m

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

mjcohen
7 replies
19h50m

About 20 years ago, I was programming for the Hughes radar signal processor, a highly parallel pipelined machine which accounted for much of Hughes success in radar processing. Anyway, I needed to compute e^x for 0 < x < 1. The processor had a multiply, so I used four 256 long tables of e^x for each possible 8-bit values in the 4 blocks in the 32-bit word, multiplied them to get the final value. It was about 5 times as fast as the previous best e^x routine. That machine was fun! It is obsolete now, but for many years is could process radar signals faster that processors that were nominally many times faster.

evmar
6 replies
18h4m

In case anyone else wasn't quite following this, I think the idea is

e^x

= e^(a+b+c+d) (where abcd are the bytes of x)

= e^a * e^b * e^d * e*d

and then you make a lookup table for each of those e^a, e^b values.

(I fudged "a" here, where it's really more like "high byte << 24", but I think you just make your lookup table for e^a be a map from a => e^(a<<24), and similar for the other bytes.)

dekhn
5 replies
16h46m

When I took Huffman's Cybernetics course as an undergrad, his tests were full of multiplication and you couldn't use a calculator, so he provided a table of exponents and logarithms, and you'd look up in one table, sum the values, then look up the result in another table. I was not a fan.

gdevenyi
4 replies
14h17m

This is literally what a sliderule is.

bigbacaloa
2 replies
12h39m

In their day, slide rules were the most efficient easily available option.

bluenose69
1 replies
6h51m

Slide rules were a great teaching aide, because they required you to (1) keep track of powers of 10 for yourself, and (2) understand the exponential difficulty of seeking more digits in a result.

The first point lets you guess the order of magnitude of a result before doing the detailed calculation, and that is a great way to find errors.

The second one reveals why teachers get so annoyed when 9-digit answers are given to problems with 2-digit data.

I do wish some company would start producing cheap plastic slide rules again, so my students could try them out. The online ones are not as effective as actually seeing something in the real world, squinting your eyes to guess where the cursor lies between two lines on the rule. (Hm, I can get more digits on the left end of the rule than I can on the right ...)

abecedarius
0 replies
3h23m

ThinkGeek sold a cheap plastic slide rule once. Its motion was too sticky for me to, well, stick with learning to use it. I agree there's a good product idea here if you can do it better.

acc_297
0 replies
6h0m

A good YouTube channel “Welch labs” recently did a good history of the logarithm and how it revolutionized the speed at which people could do math

mhh__
6 replies
20h26m

replaces short[65536] look up table

Is that not quite dim to begin with (having a LUT the size of the whole L1 cache?) or does it work surprisingly well because of some probabilistic fudging?

andy99
3 replies
20h13m

Why you (probably) shouldn't use a lookup table https://specbranch.com/posts/lookup-tables/ gives some discussion about when it's appropriate generally. My narrow experience is that you can do a lot of real time calculation before it's faster to do a lookup.

o11c
1 replies
19h24m

One case that doesn't mention explicitly: sometimes your problem is lucky enough that you have the choice between:

1. for each function, for each datum, call function with datum

2. for each datum, for each function, call function with datum

For 1, if your main data is fetched predictably but does not saturate bandwidth, it can be beneficial to use even a fairly large LUT. Note also that while actual L1i is semi-ignorable, e.g. branch prediction benefits much more from remaining "in cache".

For 2, assuming other functions also have their own miscellaneous data, the non-LUT approach is likely to win. But you should probably benchmark against #1.

But sometimes you can get the best of both worlds:

3. for each arbitrary chunk of data (smaller than L3, or maybe even L2 depending on when your throughput saturates), for each function, for each datum in the chunk, call function with datum

Yes, you should of course do whole-system benchmarks, but being able to guess helps you write a useful implementation to benchmark in the first place.

pclmulqdq
0 replies
18h40m

In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.

Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.

Also, in the interest of disclosure, I wrote the blog post you are responding to.

anonymoushn
0 replies
14h41m

For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)

Tuna-Fish
1 replies
19h49m

The lookup table does surprisingly well because the workload is otherwise extremely cache-hostile, and it doesn't really matter if you blow up your L1 cache, none of the data you evicted because you needed to fit the LUT was ever going to be reused anyway.

ML loads in general are streaming loads that linearly load the entire dataset for every iteration.

pclmulqdq
0 replies
18h53m

One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.

* Not quite all of it - There are still buffers to do write combining and some read caching on scans.

SeanAnderson
6 replies
20h38m

off topic, but sheesh. I was skimming this and thought, "This seems like a crazy optimization. It's complex and the code in question has had a ton of eyes on it already." and then I saw the contributor and was like "Of course it's jart. It's always jart with the crazy good solutions."

well done :)

larodi
3 replies
17h26m

Its not something that they teach at asymptotic analysis lectures, right? which reminds me of professor who famously said - that constant that everyone's just overlooking can in engineering terms very much eat your head.

hinkley
0 replies
14h15m

A big C could be $3M in losses per year vs $2M in profits.

Or always coming in second on contracts or sales and getting $0.

boshalfoshal
0 replies
16h58m

Unrelated but I see this sentiment a lot. Algorithms (and related courses) are supposed to be largely mathematical/theory based so it sort of has to be hardware agnostic.

Imo an algorithms course doesn't really need to teach you about lowering the constant factor (unless the optimizations for lowering said factor are reasonably abstracted away from physical HW), thats the purpose of Computer Architecture or Operating Systems course.

And realistically in the real world you'd be benchmarking these things anyway if a supposedly "optimal" algorithm is really the bottleneck, and you'd apply context specific optimizations to speed it up.

bee_rider
0 replies
40m

When I did tutoring for an algorithms class, we’d generally make gestures at the fact that the big-O cost doesn’t tell you everything. But I can’t think of any way to handle that on a systemic level. Just plot n vs t and observe they the big-O prediction comes true for large n, but it gets complicated for small.

After all, nobody would have invented big-O if it wasn’t much easier than the alternative!

neonsunset
0 replies
20h13m

Mostly looks scary* because that's just how it is with intrinsics syntax in C and C++. As with many things there, this pain is mostly self-inflicted.

There are C++ libraries that allow for C#-style SIMD and hardware instrinsics syntax as far as I'm aware. It comes at a disadvantage as you can't directly lookup mnemonics in ISAs documentation though.

*not to seem as if dismissing the importance of the work done there, just highlighting that it could have been much more accessible to wider audience, but I'm not gonna suggest something that everyone here would consider preposterous such as rewriting inference back-end in C# just yet

mhh__
0 replies
20h28m

adapted from arm limited optimized routine

Shoulders of giants and all that

a_wild_dandan
5 replies
21h24m

(in llama.cpp, for CPU)

jart
4 replies
20h58m

I developed this originally for llamafile, which was included in the last two releases: https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.2 Now we're upstreaming it to the llama.cpp project. There are other performance enhancements you can currently only get from llamafile, such as Kawrakow's work making K quants go much faster.

pclmulqdq
1 replies
18h17m

Wait, computing SiLU directly using some numerical analysis is probably a lot faster than doing an exp each time. Is there a significant perf impact to doing this?

jart
0 replies
18h13m

With expf() most of the work had already been done and I could kill two birds with one stone. If you want to do the math for doing SiLU directly, that'd be an awesome change I'd happily merge into llamafile. You might even be able to get that into PyTorch and even bigger name projects.

breakingcups
1 replies
20h41m

Is that just because nobody has made an effort yet to port them upstream, or is there something inherently difficult about making those changes work in llama.cpp?

leblancfg
0 replies
5h50m

I get the impression most llama.cpp users are interested in running models on GPU. AFAICT this optimization is CPU-only. Don't get me wrong – a huge one! – and opens the door to running llama.cpp on more and more edge devices.

boulos
2 replies
15h2m

Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.

janwas
1 replies
8h9m

With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)

celrod
0 replies
7h47m

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.

jart
0 replies
12h29m

Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().

lxe
1 replies
19h52m

At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?

GuuD
0 replies
19h27m

The difference is negligible on 2x 4090. There are more important differences like 4 bit KV cache.

KaoruAoiShiho
0 replies
19h39m

This does help the gguf usecase of partial offloading to GPU? It'll help the CPU part be faster too?