HN comments for: New exponent functions that make SiLU and SoftMax 2x faster, at full accuracy

koe123

15 replies

21h37m

2024-05-15 20:50:17 UTC

Perhaps somewhat off topic, does anyone know how stuff like ggml compares to runtimes (tensorflow lite, onnxruntime, etc.)?

refulgentis

10 replies

21h16m

2024-05-15 21:10:35 UTC

I'm intimately familiar, maintain ONNX and llama.cpp Flutter libraries across all 6 True Platforms.

Quick opinionated TL;DR:

- llama.cpp for LLMs and can do whisper with it's core dependency, GGML.

- ONNX for everything else.

- TF is the Apple of ML, it's great if you're completely wedded to Google ML ecosystem. Virtually dead outside that. (Something absurd ,like, 94%, of HF models are Pytorch)

- only chance I'd have to do a direct comparison in inference performance is Whisper in ONNX vs. GGML, someone got my llama .cpp lib running with Whisper and didn't report significant perf difference

catgary

4 replies

19h54m

2024-05-15 22:32:44 UTC

I don’t think that’s a terribly fair description of Google - AWS’s chips (inferon and trainium) both have robust XLA/JAX support. Plus JAX now exports MLIR, so there is a really compelling JAX -> IREE pipeline so JAX models can more or less be deployed anywhere, even on bare metal embedded devices.

refulgentis

3 replies

19h41m

2024-05-15 22:45:50 UTC

You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.

For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.

And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)

* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.

catgary

2 replies

19h19m

2024-05-15 23:07:36 UTC

I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.

refulgentis

1 replies

19h6m

2024-05-15 23:20:37 UTC

I didn’t mention tensorflow or TF Lite

I know.

OP didn't mention JAX, or IREE.

I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.

Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.

[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.

catgary

0 replies

17h59m

2024-05-16 00:27:35 UTC

Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.

Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.

svnt

2 replies

20h52m

2024-05-15 21:35:03 UTC

What are the True Platforms, in this case?

crthpl

1 replies

20h46m

2024-05-15 21:40:48 UTC

I'm guessing MacOS, Linux, Android, Windows, iOS, and Web?

refulgentis

0 replies

19h53m

2024-05-15 22:34:28 UTC

Correct

a_wild_dandan

1 replies

20h5m

2024-05-15 22:21:54 UTC

What’re your thoughts on MLX? It’s been phenomenal on my MBP.

refulgentis

0 replies

19h34m

2024-05-15 22:52:52 UTC

No time* to try it unfortunately :( Sounds great, though, and Mac just kicks the pants of every other platform on local inference thanks to Metal, I imagine MLX must extend that lead to the point Qualcomm/Google has to a serious investment in open source acceleration. Cheapest iPhone from 2022 kicks the most expensive Android from 2023 (Pixel Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.

* it sounded like a great idea to do an OpenAI LLM x search app. Then it sounded like a great idea to add embeddings locally for privacy (thus, ONNX). Then it sounded like a great idea to do a local LLM (thus, llama.cpp). Then it sounded like a great idea to differentiate by being on all platforms, supported equally. Really taxing. Think I went too far this time. It works but, jeez, the workload...hopefully, after release, it turns out maintenance load is relatively low

ilaksh

3 replies

21h26m

2024-05-15 21:01:14 UTC

On what hardware exactly?

koe123

2 replies

20h17m

2024-05-15 22:10:32 UTC

Probably should have specified that! I’m referring to cpu inference.

ilaksh

1 replies

18h37m

2024-05-15 23:49:36 UTC

On what CPU?

koe123

0 replies

10h47m

2024-05-16 07:39:35 UTC

Typically I'm considering embedded applications (armv8, armv7)

DeathArrow

8 replies

12h22m

2024-05-16 06:05:12 UTC

With all such optimizations, isn't llama too slow to be run on a CPU with a reasonable amount of parameters?

jart

5 replies

11h54m

2024-05-16 06:33:05 UTC

I'm gpu poor so I use cpu to run large models. For example, llamafile running mixtral 8x22b on my threadripper can chew through long legal documents and give me advice in a few minutes, using f16 weights and kahan summation. Show me someone who has that many graphics cards.

bsenftner

3 replies

6h56m

2024-05-16 11:30:48 UTC

Are you "gpu poor" by choice? I would have thought you, of all people, would be "gpu wealthy" like few can imagine.

jart

2 replies

6h20m

2024-05-16 12:07:31 UTC

I'm a hobbyist developer. I haven't been employed in six years. In the past few months I've been fortunate enough to find work as a TVC for Mozilla, which has been wonderful. It's a good lifestyle that affords me a great deal of freedom and autonomy, but certainly not a lot of GPUs. I have an RTX 4090, a few XTX cards, and I rent stuff from Vast occasionally, but that's about it.

baobabKoodaa

1 replies

3h6m

2024-05-16 15:21:18 UTC

I love how RTX 4090 is now "GPU poor"

jart

0 replies

2h52m

2024-05-16 15:35:32 UTC

It can't run most of the open source models being released today. It's not useful if your job is to keep up with LLM releases.

kkzz99

0 replies

8h10m

2024-05-16 10:16:40 UTC

What are your tokens/s on that setup?

nxpnsv

0 replies

12h17m

2024-05-16 06:09:38 UTC

I don't see how that conclusion follows from this optimization.

baq

0 replies

12h15m

2024-05-16 06:12:21 UTC

Models get better for equal parameters, CPUs get faster and good folks at intel, amd and arm are probably working very hard to catch up with apple silicon in terms of memory architecture. I can see this be very relevant in a couple of years.

mysteria

7 replies

20h24m

2024-05-15 22:02:44 UTC

How much do these silu and softmax improvements affect the LLM inference speed as a whole? Correct me if I'm wrong but I feel that this change will only have a small effect as the majority of the time is spent doing matrix multiplications.

terafo

6 replies

19h26m

2024-05-15 23:01:12 UTC

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

cgearhart

4 replies

19h11m

2024-05-15 23:16:23 UTC

Why does it disproportionately use bandwidth?

jacobn

3 replies

18h15m

2024-05-16 00:11:39 UTC

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

bjornsing

1 replies

14h48m

2024-05-16 03:39:05 UTC

Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2

0 replies

13h7m

2024-05-16 05:20:01 UTC

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

cgearhart

0 replies

17h56m

2024-05-16 00:30:47 UTC

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

tehsauce

0 replies

18h33m

2024-05-15 23:54:06 UTC

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

mjcohen

7 replies

19h50m

2024-05-15 22:37:18 UTC

About 20 years ago, I was programming for the Hughes radar signal processor, a highly parallel pipelined machine which accounted for much of Hughes success in radar processing. Anyway, I needed to compute e^x for 0 < x < 1. The processor had a multiply, so I used four 256 long tables of e^x for each possible 8-bit values in the 4 blocks in the 32-bit word, multiplied them to get the final value. It was about 5 times as fast as the previous best e^x routine. That machine was fun! It is obsolete now, but for many years is could process radar signals faster that processors that were nominally many times faster.

evmar

6 replies

18h4m

2024-05-16 00:22:38 UTC

In case anyone else wasn't quite following this, I think the idea is

e^x

= e^(a+b+c+d) (where abcd are the bytes of x)

= e^a * e^b * e^d * e*d

and then you make a lookup table for each of those e^a, e^b values.

(I fudged "a" here, where it's really more like "high byte << 24", but I think you just make your lookup table for e^a be a map from a => e^(a<<24), and similar for the other bytes.)

dekhn

5 replies

16h46m

2024-05-16 01:41:04 UTC

When I took Huffman's Cybernetics course as an undergrad, his tests were full of multiplication and you couldn't use a calculator, so he provided a table of exponents and logarithms, and you'd look up in one table, sum the values, then look up the result in another table. I was not a fan.

gdevenyi

4 replies

14h17m

2024-05-16 04:10:17 UTC

This is literally what a sliderule is.

bigbacaloa

2 replies

12h39m

2024-05-16 05:48:08 UTC

In their day, slide rules were the most efficient easily available option.

bluenose69

1 replies

6h51m

2024-05-16 11:36:17 UTC

Slide rules were a great teaching aide, because they required you to (1) keep track of powers of 10 for yourself, and (2) understand the exponential difficulty of seeking more digits in a result.

The first point lets you guess the order of magnitude of a result before doing the detailed calculation, and that is a great way to find errors.

The second one reveals why teachers get so annoyed when 9-digit answers are given to problems with 2-digit data.

I do wish some company would start producing cheap plastic slide rules again, so my students could try them out. The online ones are not as effective as actually seeing something in the real world, squinting your eyes to guess where the cursor lies between two lines on the rule. (Hm, I can get more digits on the left end of the rule than I can on the right ...)

abecedarius

0 replies

3h23m

2024-05-16 15:03:56 UTC

ThinkGeek sold a cheap plastic slide rule once. Its motion was too sticky for me to, well, stick with learning to use it. I agree there's a good product idea here if you can do it better.

acc_297

0 replies

6h0m

2024-05-16 12:27:23 UTC

A good YouTube channel “Welch labs” recently did a good history of the logarithm and how it revolutionized the speed at which people could do math

mhh__

6 replies

20h26m

2024-05-15 22:01:26 UTC

replaces short[65536] look up table

Is that not quite dim to begin with (having a LUT the size of the whole L1 cache?) or does it work surprisingly well because of some probabilistic fudging?

andy99

3 replies

20h13m

2024-05-15 22:14:16 UTC

Why you (probably) shouldn't use a lookup table https://specbranch.com/posts/lookup-tables/ gives some discussion about when it's appropriate generally. My narrow experience is that you can do a lot of real time calculation before it's faster to do a lookup.

o11c

1 replies

19h24m

2024-05-15 23:02:36 UTC

One case that doesn't mention explicitly: sometimes your problem is lucky enough that you have the choice between:

1. for each function, for each datum, call function with datum

2. for each datum, for each function, call function with datum

For 1, if your main data is fetched predictably but does not saturate bandwidth, it can be beneficial to use even a fairly large LUT. Note also that while actual L1i is semi-ignorable, e.g. branch prediction benefits much more from remaining "in cache".

For 2, assuming other functions also have their own miscellaneous data, the non-LUT approach is likely to win. But you should probably benchmark against #1.

But sometimes you can get the best of both worlds:

3. for each arbitrary chunk of data (smaller than L3, or maybe even L2 depending on when your throughput saturates), for each function, for each datum in the chunk, call function with datum

Yes, you should of course do whole-system benchmarks, but being able to guess helps you write a useful implementation to benchmark in the first place.

pclmulqdq

0 replies

18h40m

2024-05-15 23:47:04 UTC

In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.

Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.

Also, in the interest of disclosure, I wrote the blog post you are responding to.

anonymoushn

0 replies

14h41m

2024-05-16 03:46:11 UTC

For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)

Tuna-Fish

1 replies

19h49m

2024-05-15 22:38:31 UTC

The lookup table does surprisingly well because the workload is otherwise extremely cache-hostile, and it doesn't really matter if you blow up your L1 cache, none of the data you evicted because you needed to fit the LUT was ever going to be reused anyway.

ML loads in general are streaming loads that linearly load the entire dataset for every iteration.

pclmulqdq

0 replies

18h53m

2024-05-15 23:33:52 UTC

One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.

* Not quite all of it - There are still buffers to do write combining and some read caching on scans.

SeanAnderson

6 replies

20h38m

2024-05-15 21:49:24 UTC

off topic, but sheesh. I was skimming this and thought, "This seems like a crazy optimization. It's complex and the code in question has had a ton of eyes on it already." and then I saw the contributor and was like "Of course it's jart. It's always jart with the crazy good solutions."

well done :)

larodi

3 replies

17h26m

2024-05-16 01:00:47 UTC

Its not something that they teach at asymptotic analysis lectures, right? which reminds me of professor who famously said - that constant that everyone's just overlooking can in engineering terms very much eat your head.

hinkley

0 replies

14h15m

2024-05-16 04:11:51 UTC

A big C could be $3M in losses per year vs $2M in profits.

Or always coming in second on contracts or sales and getting $0.

boshalfoshal

0 replies

16h58m

2024-05-16 01:28:40 UTC

Unrelated but I see this sentiment a lot. Algorithms (and related courses) are supposed to be largely mathematical/theory based so it sort of has to be hardware agnostic.

Imo an algorithms course doesn't really need to teach you about lowering the constant factor (unless the optimizations for lowering said factor are reasonably abstracted away from physical HW), thats the purpose of Computer Architecture or Operating Systems course.

And realistically in the real world you'd be benchmarking these things anyway if a supposedly "optimal" algorithm is really the bottleneck, and you'd apply context specific optimizations to speed it up.

bee_rider

0 replies

40m

2024-05-16 17:47:21 UTC

When I did tutoring for an algorithms class, we’d generally make gestures at the fact that the big-O cost doesn’t tell you everything. But I can’t think of any way to handle that on a systemic level. Just plot n vs t and observe they the big-O prediction comes true for large n, but it gets complicated for small.

After all, nobody would have invented big-O if it wasn’t much easier than the alternative!

neonsunset

0 replies

20h13m

2024-05-15 22:13:58 UTC

Mostly looks scary* because that's just how it is with intrinsics syntax in C and C++. As with many things there, this pain is mostly self-inflicted.

There are C++ libraries that allow for C#-style SIMD and hardware instrinsics syntax as far as I'm aware. It comes at a disadvantage as you can't directly lookup mnemonics in ISAs documentation though.

*not to seem as if dismissing the importance of the work done there, just highlighting that it could have been much more accessible to wider audience, but I'm not gonna suggest something that everyone here would consider preposterous such as rewriting inference back-end in C# just yet

mhh__

0 replies

20h28m

2024-05-15 21:58:50 UTC

adapted from arm limited optimized routine

Shoulders of giants and all that

a_wild_dandan

5 replies

21h24m

2024-05-15 21:03:06 UTC

(in llama.cpp, for CPU)

jart

4 replies

20h58m

2024-05-15 21:29:08 UTC

I developed this originally for llamafile, which was included in the last two releases: https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.2 Now we're upstreaming it to the llama.cpp project. There are other performance enhancements you can currently only get from llamafile, such as Kawrakow's work making K quants go much faster.

pclmulqdq

1 replies

18h17m

2024-05-16 00:10:14 UTC

Wait, computing SiLU directly using some numerical analysis is probably a lot faster than doing an exp each time. Is there a significant perf impact to doing this?

jart

0 replies

18h13m

2024-05-16 00:13:36 UTC

With expf() most of the work had already been done and I could kill two birds with one stone. If you want to do the math for doing SiLU directly, that'd be an awesome change I'd happily merge into llamafile. You might even be able to get that into PyTorch and even bigger name projects.

breakingcups

1 replies

20h41m

2024-05-15 21:45:45 UTC

Is that just because nobody has made an effort yet to port them upstream, or is there something inherently difficult about making those changes work in llama.cpp?

leblancfg

0 replies

5h50m

2024-05-16 12:36:39 UTC

I get the impression most llama.cpp users are interested in running models on GPU. AFAICT this optimization is CPU-only. Don't get me wrong – a huge one! – and opens the door to running llama.cpp on more and more edge devices.

petermcneeley

3 replies

15h35m

2024-05-16 02:51:36 UTC

One can vectorize LUTs as well https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

I wrote about the kinds of things that are possible with LUTs awhile back https://darkcephas.blogspot.com/2018/10/validating-utf8-stri...

boulos

2 replies

15h2m

2024-05-16 03:25:30 UTC

Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.

janwas

1 replies

8h9m

2024-05-16 10:17:51 UTC

With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)

celrod

0 replies

7h47m

2024-05-16 10:39:51 UTC

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.

rdevulap

1 replies

13h32m

2024-05-16 04:55:21 UTC

On similar lines, faster tanh https://github.com/microsoft/onnxruntime/pull/20612

jart

0 replies

12h29m

2024-05-16 05:58:27 UTC

Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().

lxe

1 replies

19h52m

2024-05-15 22:34:39 UTC

At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?

GuuD

0 replies

19h27m

2024-05-15 22:59:56 UTC

The difference is negligible on 2x 4090. There are more important differences like 4 bit KV cache.

KaoruAoiShiho

0 replies

19h39m

2024-05-15 22:47:49 UTC

This does help the gguf usecase of partial offloading to GPU? It'll help the CPU part be faster too?