Designing a SIMD Algorithm from Scratch

It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).

I really wish compilers were better at auto vectorization. And some support added for annotations in the language to locally allow reordering some operations, ...

There are unfortunately a lot of serial guarantees that are unavoidable even if compilers could optimize perfectly. For example: 'for(double v: vec) sum+=v'

Floating point addition is not associative, so summing each value in order is not the same as summing every 8th element, then summing the remainder, which is how SIMD handles it. So even though this is an obvious optimization for compilers, they will prioritize the serial guarantee over the optimization unless you tell it to relax that particular guarantee.

It's a mess and I agree with janwas: use a library (and in particular: use Google Highway) or something like Intel's ISPC when your hot path needs this.

Hence my suggestion for language support. Add some language syntax where you can say: "add these doubles, order doesn't matter", which allows the compiler to use SIMD

-ffast-math if you want to push the onus on the compiler.

Problem is you need to enforce that requirement on all user compilations, and I don't know what for MSVC. In-language would be nice.

-ffast-math is global though and can break other libraries. I mean local, vectorization compatible but portable syntax, without actually writing it (let compiler do the work)

According to https://stackoverflow.com/questions/40699071/can-i-make-my-c... you annotate it with: __attribute__((optimize("-ffast-math")))

I tried it with:

  double sum1(double arr[128]) {
    double tot = 0.0;
    for (int i=0; i<128; i++) {
      tot += arr[i];
    }
    return tot;
  }

  __attribute__((optimize("-ffast-math")))
  double sum2(double arr[128]) {
    double tot = 0.0;
    for (int i=0; i<128; i++) {
      tot += arr[i];
    }
    return tot;
  }

The Compiler Explorer (gcc 13.2, --std=c++20 -march=native -O3) generates two different bodies for those:

    sum1(double*):
        lea     rax, [rdi+1024]
        vxorpd  xmm0, xmm0, xmm0
    .L2:
        vaddsd  xmm0, xmm0, QWORD PTR [rdi]
        add     rdi, 32
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-24]
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-16]
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-8]
        cmp     rax, rdi
        jne     .L2
        ret
    sum2(double*):
        lea     rax, [rdi+1024]
        vxorpd  xmm0, xmm0, xmm0
    .L6:
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdi]
        add     rdi, 32
        cmp     rax, rdi
        jne     .L6
        vextractf64x2   xmm1, ymm0, 0x1
        vaddpd  xmm1, xmm1, xmm0
        vunpckhpd       xmm0, xmm1, xmm1
        vaddpd  xmm0, xmm0, xmm1
        vzeroupper
        ret

It is still compiler-specific and non-portable, but at least it is not global.

I'm guilty of having used it, but form the doc: " The optimize attribute should be used for debugging purposes only. It is not suitable in production code."

It can be very unreliable.

One of the things I struggled with in Rust’s portable_simd is that I think fadd_fast and simd intrinsics don’t mix well. This makes writing dot products only ok, rather than really easy. Here’s (one) reference: https://internals.rust-lang.org/t/suggestion-ffastmath-intri...

Totally fair. I'm with you.

It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).

But that’s one of the points of a systems programming language (of which C++ is one) — it tries to be portably as efficient as possible but makes it easy to do target-specific programming when required.

I really wish compilers were better at auto vectorization

FORTRAN compilers sure are, since aliasing is not allowed. C++ is kneecapped by following C’s memory model.

But why is C++ refusing to add "restrict" to its spec? C has restrict in its spec, C++ has not. Of course compilers inherit restrict from C, but it's not super portable due to C++ not officially supporting it. But why?!

I was wondering the same! I just recently came across the restrict keyword (thanks to someone mentioning it on HN) but I have never seen it being mentioned before, even though I was looking into all sorts of optimizations (also ChatGPT would never bring it up on its own).

Since with templating everything seems to be going header only anyways, I feel like we should give the compiler more power to reason about the memory layout, even through multiple levels of function invocation/inlining.

I wrote some library where basically all shapes can be infered at compile-time (through templates) but the compiler only rarely seems to take advantage of this.

In or out of the spec doesn't really matter. You can use __restrict__ in C++ and any compiler will diligently accept it, if not for the occasional warning.

The larger issue is that C and C++ have convoluted aliasing rules which makes reasoning and flagging non-aliased regions hard. The most innocent of cast can break strict aliasing rules and send you in horrible debugging scenarios. The sad side effect being people adding -fno-strict-aliasing to -O2/3.

It's not just aliasing. gcc is quite bad at autovectorising in general, clang is much better.

Here's a trivial algorithm that gcc borks over because 'bool' is used instead of 'int', even fixing that the codegen isn't great:

https://godbolt.org/z/hf7szo78E

Fortran compilers can assume that (many) dummy arguments to a particular procedure reference do not alias each other, true. But there are many other ways in which aliasing can happen in a conforming Fortran program.

You could just use CUDA, which is C++ designed for GPUs, the ultimate SIMD machine of today.

Or ROCm (basically CUDA but for AMD).

I always was a fan of Microsoft's C++AMP though. I thought that was easiest to get into. Too bad it never stuck though.

DPC++/SYCL might be more interesting in this area (GPU code looking like evolved C++).

ISPC is an interesting take on 'building a vector DSL which can easily be integrated in a C++ build-chain'.

Though calling GPUs SIMD nowadays is kind of reductive, since you also have to contain with a very constrained memory hierarchy, and you don't write or think your code as SIMD much but more. Also they don't give access to most bit-tricks, and the rigmarole of shuffles one would expect from a SIMD processor.

This happens regularly in my experience :) Also, when using a SIMD wrapper library, it can actually be quite portable in practice.

Good SIMD code requires careful consideration of how your data is laid out in memory. This makes auto vectorization really difficult since the compiler can’t fix your data for you outside of very local contexts

Great article, and it’s pretty interesting to see portable simd in use. I tried reproducing the benchmarks on a Zen 3 system, and I’m getting identical speedups. On my M1 mbp, the perf gains start around 1.4x, gradually increasing to 2x @ 110 bytes of input length. A bit less gains than on x86_64, but I’d say it fulfils its goal.

Although looking at the code, this article confirms my experience that Rust has rather poor ergonomics for simd and pointer-related work (and performance engineering in general).

I disagree with the ergonomics being poor. C++ SSE intrinsics are significantly worse, with ugly underscores and names that are hard to remember.

Being more ergonomic than C++ is a low bar anyways, we should hold ourselves to a higher standard. For starters, field offset syntax is painful (see Gankra's post on this topic) and pointer manipulation lacks syntactic sugar. I also believe unsigned integer promotion and UFCS would improve ergonomics with high-perf code (although I'm aware this is not going to be included in Rust).

At that point it is better to have some kind of DSL that should not be in the main language, because it would target a much lower level than a typical program. The best effort I've seen in this scene was Google's Highway [1] (not to be confused with HighwayHash) and I even once attempted to recreate it in Rust, but it is still a bit far from my ideal.

[1] https://github.com/google/highway

Thanks :) We're always curious to hear how it could be improved?

I think Highway is already good enough as a non-DSL. I had something like GLSL in mind, but more blended to the parent language.

I agree about having a DSL. It may not need to be low level, you could have an array language like APL/J. But glsl would work for me too, the important thing is to lure developers to expose parallelism.

Check out the Clang extended vector extension for C, IMHO that's the perfect way to add portable SIMD to a language. Doing it through a library will always be clumsy.

It quickly runs into __builtin functions though, which is not very different from intrinsics with a better and portable name. Rust `Simd` type is much more consistent with corresponding scalar types compared to that.

True, I think the builtins were necessary where the existing C operators don't map well to a specific SIMD feature (from what I've seen it's mostly "reduce" actions to reduce SIMD lanes into a single value). Still, a lot can be done without having to pepper the code with builtin functions, and the result is much more readable than purely intrinsic-based code.

But a "new" language like Rust would have the freedom to add special language syntax and operators for such things instead of going the C++ way of "just add it to the stdlib" (which IMHO is almost always the wrong place... it should either go into a regular cargo dependency, or into the language itself).

I loved the Intel Cilk Plus project. I was sad to see it was abandoned. It always felt like a very natural syntax at least to me.

Those intrinsics are specified by Intel, not C++

As if semantics are a significant productivity factor, get off ya high horse.

Haven't read through the article yet, but as a Rust eng... I sort of agree, but ... I mean, pointer & raw memory related work is deliberately kneecapped in the name of safety, the language wants you to really think about what you're doing.

But yep portable SIMD in Rust is still not a good story, compared to C++. And getting down to raw byte regions, pointer & buffer manipulation, etc requires becoming comfortable with Pin, MaybeUninit, etc. Both portable simd and allocator_api have been sitting unstable for years. And the barrier to entry is a bit higher, and it's more awkward ... mostly on purpose

But there's nothing stopping one from building one's own abstractions (or using 3rd party crates etc) to make these things more ergonomic within one's own program?

Interesting article! Right at the start, the first example of a non-vectorized popcnt implementation is said to produce “comically bad code, frankly”, but in release mode with the native target CPU, it does appear to be vectorising the function fairly ok?

https://godbolt.org/z/WE1Eq65jY

The following should produce equivalent output:

  pub fn popcnt(mut x: u32) -> u32 {
    x.count_ones()
  }

which gets compiled to:

  example::popcnt:
        popcnt  eax, edi
        ret

For large bit vectors, an AVX2 implementation can outperform POPCNT. See "Faster Population Counts Using AVX2 Instructions" at https://academic.oup.com/comjnl/article/61/1/111/3852071

32 bits is not large enough, and the code Rust produces is indeed comically bad.

Would the equivalent c++ code run through LLVM generate a different output?

LLVM has a LoopIdiomRecognize pass which will, amongst other patterns, try to detect a loop implementing popcount [1] and convert it to the llvm.ctpop.* intrinsic [2]. I've not looked at why it's not matching this loop, but the sibling comment suggesting to just use `.count_ones` seems like the right answer for Rust. In C or C++ I'd strongly recommend using `__builtin_popcount` (provided by both Clang and GCC).

[1]: https://github.com/llvm/llvm-project/blob/08a6968127f04a40d7... [2]: https://llvm.org/docs/LangRef.html#llvm-ctpop-intrinsic

`std::popcount` was added to the stl in C++20 https://en.cppreference.com/w/cpp/numeric/popcount

Thanks for pointing that out - definitely a better choice for C++.

TIL: https://doc.rust-lang.org/std/primitive.usize.html#method.co...

I mean ideally you’d probably want this to be lowered to a popcnt instruction.

Yes, just because you see SIMD instructions doesn't mean the code is fast. You need the correct instructions.

Not relevant to this example, which is `popcount`ing only a single number, but AVX512 did also introduce SIMD popcount instructions for getting many inputs at a time. Also true for other useful bit-funs like leading or trailing zeros. So if you're using zmm registers, you can do way better than that godbolt example.

autovectorization is hit and miss. I recently wrote this which has to count the bits in a result mask of vector operations and it turns into popcnt just fine.

https://godbolt.org/z/zT9Whcnco

Wow I really like that little mini-map on the right...

+1. Does someone know how to do that?

Found a canvas-based library for this: https://larsjung.de/pagemap/. Definitely not what OP uses, where the minimap is a shrunk copy of the content markup, with all the drawbacks, such as page search finding the same item twice.

The author should really add at least aria-hidden="true" to the minimap element.

Firefox only: https://news.ycombinator.com/item?id=36757542

The minimap contains a copy of the content, but with `transform: scale`. The rest is handling `window.onscroll` and mouse events on the overlay.

Awesome writeup. Leaves me with the distinct impression that „I‘ll never be this smart“.

Ah, it's just not your field of work. Just like average person is not soft.engineer or physicist or... you get the point. Few months of dedicated study and you would be able to do similar for sure.

Dunning Kruger effect?

https://fgiesen.wordpress.com/2016/02/05/smart/

If you ever had occasion to have an employer and/or project where this was called for, you could probably "be this smart"; it just comes down to interest and necessity.

I dip in and out of these waters (performance optimization, more systems bare metal engineering), but basically on personal projects. I wish I had jobs where it was more called for, but it's not what most industry work needs.

Try doing AoC '23 in APL/j/k, BQN or Python/numpy (meaning, not idiomatic python, but everything with numpy) or cuda etc. It's fun and it will teach you these types of smarts; a lot of the post is very natural on how you think about solving things in those languages. After while you start seeing problems in that form.

SIMD is pretty intuitive if you’ve used deep learning libraries, NumPy, array based or even functional languages

NumPy roadmap: https://numpy.org/neps/roadmap.html :

Improvements to NumPy’s performance are important to many users. We have focused this effort on Universal SIMD (see NEP 38 — Using SIMD optimization instructions for performance) intrinsics which provide nice improvements across various hardware platforms via an abstraction layer. The infrastructure is in place, and we welcome follow-on PRs to add SIMD support across all relevant NumPy functions

"NEP 38 — Using SIMD optimization instructions for performance" (2019) https://numpy.org/neps/nep-0038-SIMD-optimizations.html#nep3...

NumPy docs > CPU/SIMD Optimizations: https://numpy.org/doc/stable/reference/simd/index.html

std::simd: https://doc.rust-lang.org/std/simd/index.html

"Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer and ML 10x faster" (2023-10) https://news.ycombinator.com/item?id=37808036

"Standard library support for SIMD" (2023-10) https://discuss.python.org/t/standard-library-support-for-si...

Automatic vectorization > Techniques: https://en.wikipedia.org/wiki/Automatic_vectorization#Techni...

SIMD: Single instruction, multiple data: https://en.wikipedia.org/wiki/Single_instruction,_multiple_d...

Category:SIMD computing: https://en.wikipedia.org/wiki/Category:SIMD_computing

Vectorization: Introduction: https://news.ycombinator.com/item?id=36159017 :

GPGPU > Vectorization, Stream Processing > Compute kernels: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

I don't think so. SIMD is more like a weird mix of vector processing and bizzare mathematical transformations originated from low-level facts, and the latter becomes more significant when SIMD targets shorter inputs.

I recently helped my friend who needed a fast CCSDS Reed-Solomon decoder in Python. The existing Python library was really slow, like 40 blocks per second, while the same algorithm written in C# claimed 10,000+ blocks per second. It turned out that the Python version made use of Numpy but so badly that it was much faster to eschew Numpy. After some additional optimizations to avoid modulos, my version easily handled ~2,000 blocks per second without Numpy.

ISPC is just better, compared to bolting SIMD on C++ or Rust. Also, it supports dynamic dispatching, a feature that is painful to implement yourself.

Is it easy to call into ISPC from C++ and rust for small subroutines like the ones in TFA?

Yes, extremely easy. ISPC emits objects and header files, which can be directly linked/included from C++. Dynamic dispatching happens automatically and the widest version of the called function gets selected according to the runtime architecture.

This is a pretty good trial for Rust Simd! What were the most surprising quirks of it when you inspected the generated code?

Not OP, but one thing that surprised me was if you are doing rust Simd in a library, and part of the code is marked #[inline] but others are not you might see catastrophic performance regressions. We saw an issue where the SIMD version was over 10x slower because we missed marking one function as inline. Essentially rustc converted it from an intrinsic to a regular function call.

https://github.com/rust-lang/rust/issues/107617#issuecomment...

_mm256_cvtps_epu32 that represent a low-level operation in a specific instruction set (this is a float to int cast from AVX2)

That instruction is from AVX-512. There’s no AVX2 float to int cast. There’s in AVX1, the int is signed and the instruction is _mm256_cvtps_epi32.

In Lua we have indexes off by one and in SIMD we have keyboard letters off by one

the minimap on this page is so cool

This seems like a trick question… just an add right?

This sort of thing is part of why you'd often like to target an intermediate vector representation and let the compiler figure out the details. E.g., on Haswell chips you had multiple floating point execution units per core, and the CPU could definitely execute more than one pipelined FP operation simultaneously, but only one of those could be an `add` instruction. If you had a bunch of additions to do which didn't rely on the previously computed results (avoiding stalls), you could double your addition throughput by also dispatching a fused-multiply-add instruction where the multiplicand was just one. That _could_ execute at the same time as a normal vector FP addition.

Small note - while the compiler wasn't able to optimize that specific popcount implementation to a single instruction, it can for some other implementations, though it's of course very picky: https://godbolt.org/z/T69KxWWW8

How does this compare to fastbase64[0]? Great article, I'm happy to see this sort of thing online. I wish I could share the author's optimism about portable SIMD libraries.

[0]: https://github.com/lemire/fastbase64