return to table of content

Beating NumPy matrix multiplication in 150 lines of C

david-gpu
9 replies
8h8m

There is no reason to burden one side with Python while the other side is C, when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.

moomin
8 replies
7h52m

Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy. The overhead isn't that high, but as mentioned elsewhere in this thread, calling it correctly is important. Pitting naive numpy code against tuned C code is definitely not a fair comparison.

stinos
7 replies
6h46m

Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy.

By that reasoning, wouldn't it make more sense to wrap their C code and maybe even make it operate on numpy's array representation, so it can be called from Python?

moomin
4 replies
6h20m

I think it’s okay to say “This is the benchmark, now I’m going to compare it against something else.” It’s up to the reader to decide if a 3% (or 300%) improvement is worth the investment if it involves learning a whole other language.

david-gpu
2 replies
6h13m

If that was the goal, they should have compared NumPy to BLAS. What they did was comparing OpenBLAS wrapped in NumPy with their C code. It is not a reasonable comparison to make.

Look, I'm trying to be charitable to the authors, hard as that might be.

cnity
1 replies
4h15m

There is some reason in this comparison. You might want to answer the question: "if I pick the common approach to matrix multiplication in the world of Data Science (numpy), how far off is the performance from some potential ideal reference implementation?"

I actually do have that question niggling in the back of my mind when I use something like NumPy. I don't necessarily care exactly _where_ the overhead comes from, I might just be interested whether it's close to ideal or not.

david-gpu
0 replies
2h12m

If that was your question, you would compare against a number of BLAS libraries, which are already well optimized.

What they are doing here is patting themselves on the back after handicapping the competition. Not to mention that they have given themselves the chance to cherry pick the very best hyperparameters for this particular comparison while BLAS is limited to using heuristics to guess which of their kernels will suit this particular combination of hardware and parameters.

The authors need to be called out for this contrived comparison.

pmeira
0 replies
5h31m

It's a muddy comparison given that NumPy is commonly used with other BLAS implementations, which the author even lists, but doesn't properly address. Anaconda defaults to Intel oneAPI MKL, for example, and that's a widely used distribution. Not that I think MKL would do great on AMD hardware, BLIS is probably a better alternative.

The author also says "(...) implementation follows the BLIS design", but then proceeds to compare *only* with OpenBLAS. I'd love to see a more thorough analysis, and using C directly would make it easier to compare multiple BLAS libs.

lamcny
1 replies
6h8m

Popular does not mean best.

Suppose that this blog post were part of a series that questions the axiom (largely bolstered by academic marketing) that one needs Python to do array computing. Then it is valid to compare C directly to NumPy.

It isn't even far fetched. The quality of understanding something after having implemented it in C is far greater than the understanding gained by rearranging PyTorch or NumPY snippets.

That said, the Python overhead should not be very high if M=1000, N=1000, K=1000 was used. The article is a bit silent on the array sizes, this is somewhere from the middle of the article.

CleanRoomClub
0 replies
5h4m

Python is popular precisely because non-programmers are able to rearrange snippets and write rudimentary programs without a huge time investment in learning the language or tooling. It’s a very high level language with syntactic sugar that has a lot of data science libraries which can call C code for performance, which makes it great for data scientists.

It would be a huge detriment and time sink for these data scientist to take the time to learn to write an equivalent C program if their ultimate goal is to do data science.

ks2048
6 replies
15h44m

This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?

sbstp
2 replies
14h59m

Maybe -march=native gives it an edge as it compiles for this exact CPU model whereas numpy is compiled for a more generic (older) x86-64. -march=native would probably get v4 on a Ryzen CPU where numpy is probably targeting v1 or v2.

https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...

KeplerBoy
0 replies
11h7m

np.matmul just uses whatever blas library your NumPy distribution was configured for/shipped with.

Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.

hansvm
1 replies
14h46m

- OpenBLAS isn't _that_ optimized for any specific modern architecture.

- The matrices weren't that big. Numpy has cffi overhead.

- The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).

- The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).

- Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.

- This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.

And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.

robxorb
0 replies
11h22m

To your third point - it looks as if the lines of mean values were averaged, this posts code would still be a clear winner.

ipsum2
0 replies
9h31m

Comparison with numpy 2.0 should be better for numpy because it integrates Google highway for better simd across different microarchitectures.

jstrong
3 replies
15h53m

in terms of comparing to numpy, how much overhead would there be from Python (vs. running the numpy C code alone)?

SJC_Hacker
2 replies
15h27m

Python can efficiently call C libs if you use ctypes and native pointers, which numpy uses. Of course depends on expected layout.

If you want to convert to Python lists its is going to take time. Not sure about Python arrays.

dgan
0 replies
11h12m

I don't recall the link but there was a github repo with comparisons of CFFI implementations in different languages, and from what i remember Python was 'bout 3 orders of magnitude slower than, say, Lua or Ocaml

Edit: ha, found it https://news.ycombinator.com/from?site=github.com/dyu

brnt
0 replies
14h11m

If you use numpy, then you use an ndarray, which you can create from a C array for 'free' (no copying, just set a pointer).

SushiHippie
3 replies
10h4m

In the README, it says:

Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.

I think I'll need a TL;DR on what to change all these values to.

I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.

So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).

salykova
1 replies
2h21m

as we discussed earlier, the code really needs Clang to attain high performance

teo_zero
2 replies
8h34m

Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?

rustybolt
0 replies
8h15m

Yes

lamcny
0 replies
6h4m

Yes, as far as I can tell the array sizes were large enough to make the wrapping overhead negligible.

ssivark
2 replies
15h49m

Most common coding patterns leave a lot of performance unclaimed, by not fully specializing to the hardware. This article is an interesting example. For another interesting demonstration, see this CS classic "There's plenty of room at the top"

https://www.science.org/doi/10.1126/science.aam9744

hmaarrfk
0 replies
13h24m

Thanks for sharing. That was a great read

salykova
1 replies
12h30m

We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY

salykova
0 replies
10h9m

"off-topic" channel

KerrAvon
2 replies
16h6m

The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64?

tempay
0 replies
15h59m

I think they mean portable between modern x86 CPUs as opposed to truly portable.

rurban
0 replies
11h24m

You'd need first to convert it to portable SIMD intrinsics. There are several libraries.

marmaduke
1 replies
10h56m

Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?

Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.

pmeira
0 replies
5h26m

Don't forget BLIS itself!

dzaima
1 replies
15h21m

Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.

Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).

[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...

salykova
0 replies
12h37m

Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!

azornathogron
1 replies
17h6m

I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!

ghghgfdfgh
0 replies
16h16m

I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP!

Bayes7
1 replies
3h43m

This was a great read, thanks a lot! One a side note, any one has a good guess what tool/software they used to create the visualisations for matrix multiplications or memory outline?

salykova
0 replies
1h18m

excalidraw <3

le-mark
0 replies
16h34m

This is my first time writing a blog post. If you enjoy it, please subscribe and share it!

Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.

epr
0 replies
40m

If the point of this article is that there's generally performance left on the table, if anything it's understating how much room there generally is for improvement considering how much effort goes into matmul libraries compared to most other software.

Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.

1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)

2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.

3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.

4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)

5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.

bjourne
0 replies
3h35m

Good writeup and commendable of you to make your benchmark so easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices when compiled with gcc -O3 and 1.47 seconds when compiled with clang -O2, while NumPy does it in 1.07 seconds. I believe an avx512 kernel would be significantly faster. Another reason for the lackluster performance may be omp. IME, you can reduce overhead by managing the thread pool explicitly with pthreads (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).