return to table of content

tolower() with AVX-512

anderskaseorg
29 replies
1d14h

Note that the “unsafe read beyond of death” trick is considered undefined behavior in the Rust and LLVM memory model, even if it’s allowed by the underlying hardware. Like any undefined behavior, compilers are allowed to assume it doesn’t happen for the purpose of optimization, leading to results you don’t expect. The only way around this is to use inline assembly.

https://github.com/ogxd/gxhash/issues/82

dzaima
12 replies
1d13h

It would be neat to have non-assembly options for things like this. A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support, even if just as an alias for the respective assembly invocations.

Additional neatness would be being able to request a guarantee that all allocations - malloc, stack, constants - have at least, say, 64 bytes of non-faulting addresses after them, though that is significantly more complex, requiring cooperation between a bunch of parts.

Annoying thing is that this is trivial with a custom allocator (as long as the compiler isn't told to consider the custom sub-allocations as separate), but then you're stuck not being able to use your SIMD stuff on anything outside your custom heap due to the very tiny chance of segfaulting.

Sanitizers/valgrind don't necessarily become pointless with this even - the past-the-end values are still undefined, can be tracked as such, and error on use.

tomsmeding
5 replies
1d10h

A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support

Not an expert, but to me this sounds like you want an alternative where behaviour for a read beyond the end of an allocation is merely implementation-defined, not undefined. That means the implementation (e.g. LLVM) has to document what they do — which may be platform-dependent — and the choice of whether it becomes undefined is up to the implementation.

The natural thing to do here for the implementation is of course to say "I'm just going to emit the load instruction, it may crash your program, better be prepared".

dzaima
4 replies
1d6h

Here it'd be perfectly fine to define it as "completely arbitrary bits past the end, potentially even differing between back-to-back calls of loading the same memory"; specific backends will end up refining that of course. In LLVM those bytes would behave as freeze(poison).

tomsmeding
3 replies
1d1h

Not every platform in existence will return data when asked to access stuff out of bounds, even when sufficiently aligned. So you wouldn't want to bake into the standard that valid bits must be returned; you'd want to allow crashing, in the standard. An implementation might then define that for suitably aligned addresses, data will be returned (just not necessarily sensible data).

dzaima
2 replies
1d1h

It should still be with "UB only if the hardware doesn't like it", of course. If weird funky hardware not following usual memory paging is of worry, providing a "memory_protection_granularity" constant is trivial, to be used instead of the page size for the check (and said funky hardware could set it to 1, thus always failing).

Alternatively, a different API would be returning an optional of the loaded data, having the stdlib/language/backend convert that to the appropriate boundary check (or always returning a None if impossible).

Ideally there'd be languages that can be at least configured into providing more "unsafe" useful things, even if at the expense of not having the code be compilable targeting funky hardware that noone would run the software in question on anyway.

jart
1 replies
7h5m

What about tools like ASAN? I want it to be able to tell me if I read a single character out of bounds. Tools like ASAN can't do this if the language gets rid of undefined behavior. The reason why undefined behavior is undefined is because it's such a degenerate state for a program to exist in that any attempt by a language to imbue it with a particular blessed meaning is, to put it politely, crazy; like trying to prove a theorem that's allowed to have some contradictions.

dzaima
0 replies
5h12m

Indeed, if you an want immediate error on every out-of-bounds read, this won't be suitable. I do think one should always have the option to not opt into this. But there still exist use-cases where the benefit of being able to do partially-past-the-end loads would significantly outweigh this downside.

That said, clang's MemorySanitizer, and, similarly, valgrind, could still produce errors via tracking which bytes are undefined within registers; might be somewhat delayed between load and error, but still shouldn't allow such out-of-bound values to be used for much.

And, anyway, as this load would be a separate instruction/builtin (if so decided), UB of regular operations is unaffected. If the sanitizer in question doesn't track (partial) register definedness, it could just accept all of these explicitly-potentially-OoB loads; indeed not ideal, but the alternative is not being to write such performant code at all.

And there are already people doing this, just limited to doing so with data within a custom allocator. It would just be nice to have a mechanism to not be fully ruled out of using standard tooling at least for testing.

capitol_
2 replies
1d9h

"UB only if the hardware doesn't like it" sounds like you want to shift the complexity from the developers who know the problem domain best to the packagers.

As soon as the thing is packaged to run on an raspberry or something else that doesn't like it, it will start to generate CVEs and be a major pain.

dzaima
0 replies
1d6h

This shouldn't ever be a security vulnerability, outside of perhaps denial of service from segfaults (though I'm pretty sure you'd find hardware with no page faults before finding one with pages less than 4KB; and of course, if you wanted to not be hard-coding 4KB, a compiler providing a "minimum page size" constant for the target architecture should be possible, and could return 1 on page-less hardware). But, yes, as with many optimizations, getting them wrong could end up badly.

anonymoushn
0 replies
1d6h

For the case of specific vector extensions that imply specific cache line sizes, and loads that do not span multiple cache lines, I don't think you could run into issues.

TinkersW
1 replies
18h17m

Simplest solution and the one I use is all SIMD related buffers use a custom allocator(actually everything uses it) and it always rounds the allocation size up to the SIMD width.

Masked loads kinda suck, they are a tiny bit slower and you now need a mask and you need to compute the mask..

dzaima
0 replies
17h30m

This is what I do too (in my case I don't round up the allocation size and just let loads & stores potentially see the next object (doing tail stores via load+blend+store where needed; only works if multithreaded heap mutation isn't required though)).

The one case it can be annoying is passing pointers to constant data to custom-heap-assuming functions - e.g. to get a pointer to [n,n-1,n-2,...,2,1,0] for, say, any n≤64, make a global of [64,63,...,2,1,0] and offset its pointer; but you end up needing to add padding to the global, and this materializes as avoidable binary size increase as the "padding" could just be other constants from anywhere else. Copying the constant to the custom heap would be extra startup time and more memory usage (not sharable between processes).

the8472
0 replies
1d10h

The sanctioned way to this would be masked aligned load intrinsics, alignment avoids page faults, masking avoids reading undef bits, being an intrinsic conveys the intent to the compiler so it'll know that this is not an OOB read.

The other option that I've seen discussed is adding a freezing load to LLVM that turns the undef bits into some unspecified but valid bit patterns.

torstenvl
7 replies
1d5h

Like any undefined behavior, compilers are allowed to assume it doesn’t happen for the purpose of optimization, leading to results you don’t expect

No. First, undefined behavior is a term of art in the C standard, so the idea of generalizing it is nonsensical. Second, ANSI C explicitly does not allow this assumption, and ISO C—while more open ended—doesn't specifically justify this assumption. The entire "UB = assume it cannot happen" thing is grossly dishonest FUD.

anderskaseorg
6 replies
1d1h

Both (unsafe) Rust and LLVM have their own concepts of undefined behavior (https://doc.rust-lang.org/reference/behavior-considered-unde..., https://llvm.org/docs/LangRef.html#undefined-values), and while some of the details vary by language, compilers for all of these languages do in fact optimize based on the assumption that undefined behavior is not invoked in the abstract execution model. This is a real thing (https://blog.llvm.org/2011/05/what-every-c-programmer-should..., https://highassurance.rs/chp3/undef.html), and any debate about whether it’s “justified” is many decades late and entirely academic (but we don’t have any other methodology for building optimizers of the quality that programmers expect from compiled languages).

torstenvl
5 replies
1d

any debate about whether it’s “justified” is many decades late

"It's always been this way so it's impossible to address." Forgive me if I'm not convinced.

From https://highassurance.rs/chp3/undef.html:

In other words, should a developer inadvertently trigger UB, the program can do absolutely anything.

Well, no. It is "behavior . . . for which this International Standard imposes no requirements." There are restraints and constraints beyond the ISO standard.

realloc(p, 0) is now undefined in C23. However, every mainstream OS and compiler specifies the correct behavior for that environment. It is simply not. true. that the program can do anything. What is true is that the range of behavior is not restricted by the ISO standard.

ghshephard
4 replies
22h42m

At least 15 years ago, the team responsible for developing both our ASICs and core routing/switching code for the firmware on our networking devices worked under the consensus understanding that "Undefined Behavior" (in C, in their case) meant precisely that - could trigger a Nuclear Launch, blow up the planet, etc.... There was absolutely no behavior (within the confines of the limitations of C compiler(s) for the various hardware platforms we built on) that was not restricted.

A very significant amount of their effort, time, focus went into understanding precisely what was, and was not "Undefined Behavior" - the instant you did anything that was "undefined" anything that happened after that was fair game.

They also did zero dynamic memory allocation after starting an application. All memory was allocated on startup based on initial config settings.

My sense in watching that extraordinarily skilled team was that the logic and features they were building (on a very complex FHSS MAC) were secondary to convincing the compiler and hardware to do something that the specifications and definitions should happen. The great firmware developers were also pretty solid language lawyers.

torstenvl
3 replies
20h57m

I'm not saying developers who are careful about UB aren't doing the right thing. They are absolutely doing the right thing.

What I am saying is that a compiler that sees

    int8_t x;
    float x;
and does anything other than "terminating a translation or execution (with the issuance of a diagnostic measure)" is doing the wrong thing.

I am also saying that a compiler that offers -fwrapv and formats your hard drive on int x = INT_MAX; x++; rather than "behaving during translation or program execution in a documented manner characteristic of the environment" is pathological, violates the spirit of the ANSI and ISO standards, and violates the letter of the ANSI standard.

anderskaseorg
2 replies
18h49m

You may not like it, and it’s within your rights not to like it, but the reality is that compilers do treat UB this way, and it’s not “grossly dishonest FUD” to point out that this is the reality. Here’s a test case demonstrating that UB can actually format your hard drive: https://bugs.llvm.org/show_bug.cgi?id=49599.

Note that one of the differences between C and Rust is that integer overflow is not UB in Rust (it panics in debug mode and wraps in release mode: https://doc.rust-lang.org/book/ch03-02-data-types.html#integ...). But there are other sources of UB in unsafe Rust, such reads through a pointer not allowed by the memory model.

torstenvl
1 replies
18h28m

the reality is that compilers do treat UB this way

No, they really don't. https://godbolt.org/z/M6hcTx3aY

Clang will not blow your computer up just because you use #pragma STDC FENV_ROUND <direction> in accordance with its documentation.

I don't know why it became popular to make UB into a bogeyman and act like it's an intractable problem, but it isn't. Compilers handle most UB just fine.* It's better for everyone if we can focus on the actual problems rather than blowing it out of proportion with sweeping generalizations.

* All undefined behavior is undefined, but some undefined behavior is more undefined than others.

anderskaseorg
0 replies
17h27m

It seems like you think I’m saying that all compilers are required to recognize all forms of UB and format your hard drive. Of course not; some UB in one specification may be defined in another more restrictive spec, some UB may be defined by compiler flags like -fwrapv, some UB might be detected and converted to a compile-time or runtime error, and some UB might happen to behave like you expect because your compiler didn’t implement certain optimizations yet or you just got lucky.

It sounds like you agree that programmers should avoid relying on luck for UB safety. If you have a way to prove that this UB is actually safe and not just lucky, feel free to present it. Until then, I stand by everything I’ve said.

IshKebab
7 replies
1d11h

Is that even true at a hardware level? What if you read into an unmapped page or into protected memory? (I haven't read the code, maybe it has alignment guarantees that avoid this?)

mpweiher
3 replies
1d5h

You make sure you don't do that.

A trick to avoid reading beyond the end of the buffer is to make sure the end of the buffer lies on the same page. Typically, the OS will allocate memory in pages of 4KB, thus we can make a function that checks whether it is okay to read beyond or if we should fallback to the copy version.

-- https://ogxd.github.io/articles/unsafe-read-beyond-of-death/

IshKebab
2 replies
1d3h

That's not a guarantee. On some systems memory protection can be sub-page (not sure about x86).

But it sounds like the masking feature mentioned in a sibling comment takes care of it anyway.

slaymaker1907
0 replies
1d

There's a debate on how unsafe/unsound this technique actually is. https://github.com/ogxd/gxhash/issues/82

I definitely see the conundrum since the dangerous code is such a huge performance gain.

dzaima
0 replies
1d3h

Masking is nice, but not available everywhere (i.e. intel is still making new generations of CPUs without AVX-512, and apple silicon doesn't have any masked loads/stores either).

It might not be the nicest thing to assume to be the case on all hardware, but it shouldn't be too unreasonable to put it under an "if (arch_has_a_minimum_page_size)". So many things already assume at least 4KB pages, Intel/AMD aren't gonna break like half the world. If anything, they'd want to make larger pages to make larger L1 caches more feasible.

b3orn
2 replies
1d10h

The code uses unaligned load and store instructions, so it should be possible to trigger memory access to unmapped addresses.

janwas
0 replies
23h46m

Unfortunately, AMD's masked AVX2 instructions reserve the right to fault even for masked-off elements :(

pdpi
20 replies
1d17h

While we're having fun with that:

    # Capitalising an eszett changes the string length.
    >>> "straße".upper()
    'STRASSE'
    
    # If you don't specify the locale, round-trip upper/lower case
    # messes up the dotless i used in turkic languages.
    >>> 'ı'.upper().lower()
    'i'

kleiba
19 replies
1d14h

Up until a few years ago, the first conversion would have been the official way to write a German word that contains the ligature ß (a lower-case letter) in all caps, because there was no corresponding upper-case letter. However, in 2017, an upper-case variant [1] was introduced into the official German orthography, so the conversion cited here should no longer be necessary.

[1] https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Tomte
14 replies
1d13h

It is extremely unusual, and the addition to Unicode was very controversial.

Basically, some type designers liked to play with capital ß and thought it cool to have it included into Unicode. There was a whole campaign for inclusion, and it was a big mistake.

Because even though existing use must be shown to merit inclusion, they only managed to find a single book (a dictionary) printed in more than a few copies. From East Germany. And that book only used the capital ß for a single edition (out of dozens of editions) and reverted straight back to double s. Somehow, that still was enough for Unicode.

Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

It has no real historic lineage (like the long s, for example, that pretty much nobody under 80 knows, either), it is a faux-retro modern design, an idle plaything.

jabiko
4 replies
1d11h

Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

I don't think it's particularly confusing, but it is a great addition. As of 2017 the rules allow both "SS" and "ẞ".

Moreover, the current version of DIN 5008 (Rules for Writing and Design in Text and Information Processing) actually recommends preferring "ẞ" over "SS".

gwervc
3 replies
1d11h

What's confusing for me is that the German government seems to change its mind radically around the question every few years. If I counted properly, there was already two orthographic changes since I was schooled (German as a second language).

cedilla
1 replies
1d9h

The government isn't involved in orthography. There is an official orthography that public officials must use, but it follows the rules of the relevant non-governmental organizations.

Tomte
0 replies
1d5h

No, the government set the rules in 1996 and modified them in 2006. that was new, before that, orthography was set by convention and common usage.

Ostensibly, the reform only applied to officials in government agencies, but that also included both teachers and students in schools, and today using the old orthography in university exams is marked as errors (although many lecturers don‘t care and won‘t mark it up). Just went through it a few months ago.

Today, new orthography is not optional, at workplaces you will be corrected (and sometimes chastised) for using old orthography. Just recently a colleague went through a document I wrote and changed every "daß" to "dass". Nothing else was changed.

The ship has sailed, though, since many cohorts of students have gone through it now. I would just like people to be tolerant of what we older people learned in school. I don‘t want to re-learn everything. Just leave me in peace.

jabiko
0 replies
1d10h

I'm not sure what radical changes you are referring to. There was a orthography reform in 1996 (with a transitional period until 2005). But other than that only minor changes occurred.

kleiba
3 replies
1d13h

Well, what can you do... amtlich is amtlich.

And the behavior of a function that converts ß to double upper-case S can certainly be discussed too, if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.

josefx
1 replies
1d12h

if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.

It is an inherently destructive round trip, especially in a language that makes excessive use of upper case when comapred to english. When you have the word "wagen" did it originally refer to a "car" or did it refer to someone "trying"?

vlabakje90
0 replies
1d10h

I'm not a German speaker but this is definitely not exclusive to German nor is it caused by 'excessive' capitalization. The use of capital letters to denote nouns only helps to disambiguate it in German. While in English, this distinction is never clear just from the spelling of the isolated word alone. E.g. 'contract' can either mean 'an agreement' as a noun or 'to decrease in size' as a verb. There are plenty of other examples.

I agree though that this makes the round-trip inherently destructive.

layer8
0 replies
1d12h

Note that the addition to Unicode came first, and the addition to the official German orthography only came nine years later (and might not have happened at all without the Unicode inclusion). In addition, it remains only a variant in the latter, alongside the existing SS.

jeroenhd
2 replies
1d5h

Unicode attempts to render all human-written documents. That's why U+A66E (ꙮ), a character found once in one single document, is still considered acceptable.

ẞ is not a good capital letter of ß, but if that one single book ever gets transcribed into digital text, a unicode character code is necessary for that to happen.

I doubt systems that can't deal with the ß -> SS transcription will be able to deal with ẞ in the first place.

account42
0 replies
1d4h

ꙮ is a farce like many other Unicode inclusions. Good for a laugh but best forgotten. If you want to encode that level of stylistic choice then you really need to start encoding the specific font used too.

Next you are going to ask Unicode to include all fancy stylized initials which are of course specific to the paragraph in a single book they are used in? At that point just embed SVGs directly in Unicode because the fight for any semantic meaning has been lost.

Tomte
0 replies
1d5h

It‘s not as simple as you make it to be. Klingon script was rejected, partly because of "Lack of evidence of usage in published literature", despite many web sites and more than a handful of published books using it.

nuancebydefault
0 replies
1d12h

As a non native German speaker I really don't understand all the fuß around the capitalneSS of ß

kkt365
0 replies
1d8h

As a conlanger, I much appreciated the addition of a capital ß to Unicode. I use this letter in my conlang, and several words start with it, so it'd be natural to have a capital version of it to start sentences with. I was relieved to learn there is a defined spot for this letter in Unicode.

layer8
3 replies
1d13h

The upper-case ẞ remains very unconventional, and the Unicode casing algorithm continues to specify upper-case conversion to SS.

kleiba
2 replies
1d13h

Correct. This raises the question what should be the basis of the Unicode casing algorithm: what is commonly practiced by users of a specific language (how to measure this reliably in less clear cases than this one?) or what an official "specification" of the language defines (if such a thing exists, and is widely accepted, especially when the language is spoken in more than one country)?

stefs
1 replies
1d11h

IMO: the official specification. If it doesn't match common usage after some time, the spec should be revised.

Pretty sure the actual question would be: what if there are multiple conflicting official specs?

mschuster91
0 replies
1d10h

If it doesn't match common usage after some time, the spec should be revised

Well, the problem with any kind of language spec change is that it can take decades until it gets accepted widely.

Young people are the first adopters as they get it force-fed by schools, but getting the age bracket 40+ to adopt is a real challenge. Germany's 1996 project led to years-long battles and partial reversals in 2004/06, and really old people to this day haven't accepted it.

Asooka
3 replies
1d19h

There is a difference between strings used internally, usually as IDs, and text entered by a human. For the former you'd always use straight ASCII in 8-bit encoding, for the latter ... things get difficult. A straightforward example are DNS addresses - they can technically contain almost any Unicode, but that is always converted to a very limited subset of ASCII for actual DNS resolution, which in turn is case-insensitive.

There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.

tialaramex
0 replies
1d17h

On the wire DNS questions and answers are case preserving but not case sensitive which is important for correctness. DNS was designed a long time ago and doesn't have enough places to hide randomness (~30 bits at most are available) to protect it against a realistic attacker, so, for most names we do a terrible trick in the real world - we randomize the case during querying. This is called DNS-0x20 for obvious reasons.

Suppose a web browser wants to know news.ycombinator.com AAAA? but bad guys know it's about to ask this (e.g. they use JS to force that lookup when they wanted), they can shove a billion bogus answers (one for every possible random value) onto the wire and have a great chance to trick the browser into accepting one of these answers which seemingly is to the question it just asked. But, if we instead pick random cases we're asking about, say, NeWS.yCOmbinAtOR.cOM and we can ignore answers for nEWS.yCOMBINATOR.cOM or news.ycombinator.com or NEWS.YCOMBINATOR.COM or any other spelling. Bad guys now need to do many orders of magnitude more expensive work for the same results.

mananaysiempre
0 replies
1d17h

For what it’s worth, with IDNs you’re still going to do a kind of case folding using stringprep before doing the query, and that isn’t really better than the table GP linked. ASCII-only case-insensitive matching is indeed a thing, but it’s usually mutually exclusive with (non-programmer) user-entered data.

account42
0 replies
1d3h

There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.

Fun times when an upgrade of the Unicode library used by your compiler changes the semantics of you program.

fanf2
1 replies
1d20h

Heh, thank goodness I don’t have to deal with all that! This code is ascii-only because it arose from working on the DNS. There are other protocols that are ascii-case-insensitive so it turns up a lot in the hot path of many servers.

tialaramex
0 replies
1d17h

This is presumably Rust's u8::to_ascii_lowercase rather than C's tolower since tolower is locale sensitive (which you don't care about) and has Undefined Behaviour (because of course it does it's a C function and who cares about correctness)

Or possibly u8::make_ascii_lowercase which is the same function but with in-place mutation.

atoav
0 replies
1d17h

Note for the first example which transforms the German maße to MASSE that the German language has an uppercase Eszett as well: ẞ

This is as of now not widely deployed and few fonts support it, but in theory it is there now.

Remnant44
24 replies
1d11h

Neat and performant code like the article makes me very curious how the competition will shake out between AMD's AVX512 implementation and Intel's upcoming AVX10. The entire point of AVX10 seems to be to resolve Intel's P vs E core situation, while AMD seems to have taken a better approach of using either full width (Zen5) or double-pumped 256bit (Zen4, Zen5 mobile) as appropriate to the situation, while making the API seamless.

The big gains delivered in the article are all on a double-pumped Zen4 core! AVX512 brings a lot to the table so its quite frustrating that Intel market-segmented support for it so heavily as to completely inhibit its adoption in broad-based client code.

Tuna-Fish
16 replies
1d11h

If Intel actually implements AVX10/256 on every CPU they ship going forwards, it will eventually win simply by availability. The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator. And since AVX10.1/256 and AVX512VL have a shared subset, that will be what people will eventually target once enough time has passed and nearly everyone has a CPU that can support it.

AVX512 will continue to give AMD some easy wins on the few benchmarking apps that were actually updated to support it, but if Intel sticks with the AVX10 plan I expect that AMD will eventually just use the double-pumped SIMD pipes for everything, just because they are the more efficient way to support AVX10/256 while retaining AVX512 compatibility.

Intel did a lot of bad choices in the past decade, but segmenting the market based on instruction set has to be one of the worst. They just chose to kill all the momentum and interest in their newest and best innovations. Hopefully they actually add AVX10/256 support to the whole lineup, because the width is the least interesting part about AVX512, the masked operations especially are a lot more important.

janwas
7 replies
1d9h

Dispatching is actually heavily used. Image/video codecs, cryptography, and ML libraries routinely use it, because the lowest common denominator is very low indeed.

Tuna-Fish
6 replies
1d9h

The things you listed are probably less than 1% of all client loads. Video codecs and ML mostly run on accelerators and cryptography is a tiny proportion of all loads.

zekica
2 replies
1d7h

Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access. Add to that gzip or brotli compression (common in HTTP). Hardware decoders for newer video codecs (such as AV1) are not that common (less than 50% of devices) so most youtube watch time on desktops/laptops is software decoding. It's a lot more than 1%.

Const-me
1 replies
1d4h

Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access

For IO bandwidth-bound heavy lifting these things typically use AES algorithm. The hardware support for that algorithm is widely available inside CPU cores for more than a decade: https://en.wikipedia.org/wiki/AES_instruction_set#x86_archit... That hardware support is precisely what enabled widespread use of HTTPS or full disk encryption. Before AES-NI it was too slow, or it required specialized accelerator chips found in web servers in 2000-s who needed to encrypt/decrypt HTTPS traffic.

I don’t think people use AVX2 or AVX512 for AES because AES-NI is way faster. The runtime dispatch needs just a few implementations: hardware-based AES to use on 99% of the computers, and couple legacy SSE-only versions.

adrian_b
0 replies
1d

The original AES-NI (which were SSE instructions) and also their initial correspondent in AVX performed 128-bit operations.

Later, 256-bit AES instructions were introduced, but significantly later than AVX2. Such 256-bit AES instructions, which double the AES throughput, are available in more recent CPUs, like AMD Zen 3 and Intel Alder Lake (the so-called VAES instructions).

Some of the more recent CPUs with AVX-512 support have added 512-bit AES instructions, for an extra doubling of the AES throughput.

Zen 5 (desktop and server) doubles the AES throughput in comparison with Zen 4, similarly with the double throughput for other vector operations.

In conclusion, on x86 CPUs there are many alternatives for AES, which have different throughputs: 128-bit SSE instructions since Westmere, 128-bit AVX instructions since Sandy Bridge, 256-bit VAES AVX instructions since Zen 3 and Alder Lake and 512-bit AVX-512 instructions since Ice Lake, but only in the CPUs with AVX-512 support.

richardwhiuk
2 replies
1d7h

Surely h264 encode and decode is substantial, given the large amount of video consumed?

Const-me
1 replies
1d7h

I don’t believe many people encode or especially decode video with CPU-running codes.

Modern GPUs include hardware accelerators for popular video codecs, these are typically way more power efficient than even AVX-512 software implementations.

account42
0 replies
1d4h

I don’t believe many people encode or especially decode video with CPU-running codes.

Believe what you want but as soon as realtime is not a concern and quality matters you'll be using CPU-based encoders unless you have special hardware for your use case.

Modern GPUs include hardware accelerators for popular video codecs

... with shit quality encoders because they are designed for speed first.

Decoding is a different matter but even there older hardware can easily end up not supporting codecs (or profiles of them) that you come accross.

TinkersW
2 replies
1d5h

I don't get saying mask operations are more important than width?

Mask operations can be trivially emulated with vblend, it is one extra instruction..

Width can't be emulated, you just are stuck running half speed.

This take keeps getting repeated, but doesn't appear to be backed up by reality.

Intel hasn't even put AVX10 on their upcoming chips(skymont), so it appears to be going nowhere.

fanf2
0 replies
1d4h

The important feature of AVX-512 demonstrated in my blog post is masked loads and stores, which can't be emulated with vblend.

account42
0 replies
1d4h

Mask operations can be trivially emulated with vblend, it is one extra instruction..

For unaligned loads where you can't guarantee that the entire vector is on a mapped page?

nemetroid
1 replies
1d7h

The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator.

RHEL just moved up to x86_64-v2, equivalent to 2009 level CPUs. And they’re an early mover, Fedora/Ubuntu/Arch have not done the same.

Glibc has used CPU dispatching for str* and mem* functions for over a decade.

IWeldMelons
0 replies
6h34m

FreeBSD added AVX512 dispatch into it C library AFAIK.

Remnant44
1 replies
1d11h

I agree. I still hesitate to ship code that requires even AVX1, even though it was first introduced in 2011(!).

AVX512 really improves the instruction set. Not just from masking but from some really big holes filled in terms of instructions available that AVX2 doesn't have a good solution for.

At the same time I also DO have plenty of code that could definitely use the compute throughput improvement of 512 bit vectors. But it's definitely a more niche usage. You have to at least nominally satisfy that you: 1) Benefit from 2x the ALU throughput 2) Live mostly in the cache 3) Are not worth running on the GPU instead.

Const-me
0 replies
1d7h

It depends on the product and the market.

I’ve been shipping software that requires AVX2 and FMA3 because it is a professional CAM/CAE application. Our users typically don’t use sub-$100 Intel processors like Pentium and Celeron. The key is communication, you need to make sure users are aware of the hardware requirements before they buy.

Another example, in the server market, you can almost always count on having at least AVX2 support because most servers today are cloud-based. For people running these cloud services, power efficiency is crucial due to the high cost of electricity, so they tend to replace old hardware regularly.

On the other hand, for desktop software aimed at a broad audience, there are valid reasons to hesitate before shipping code that requires AVX1.

umanwizard
0 replies
1d9h

The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU

What do you mean? At least numpy and pytorch (the only numeric libraries I'm familiar with) both use runtime dispatching.

Aardwolf
5 replies
1d7h

I don't understand why intel doesn't make their E-cores use double pumped AVX512 to solve this issue (or just make P-core only CPUs for desktops like it should). They have had years to fix this by now. It's annoying that despite AMD's support, market share makes the adoption of this not happen anyway, and the AVX10 thing will unfortunately allow them to hold back the world even longer.

What I like to see (for desktop) is: better cores, more cores, well standardized instruction sets that unlock useful things (wide SIMD, float16, gather/scatter, ...). AMD is doing pretty well at this. What Intel is doing instead: put weak cores alongside decent cores, cripple the decent cores to keep up with the weak cores, release CPUs with the same amount of cores as before for many generations in a row, use the weak cores to make it sound like they have more cores than they have, bring out way too many variants of instructions sets to ever have a useful common set, drop support for their own promising sounding instructions

I just really dislike anything Intel has come out with or plans to come out with lately :p

My cycle of preference of manufacturer (for desktop computers) has been: 90s: Intel. Early 2000s: AMD (pentium 4 was so meh). late 2000's+2010s: Intel. Now: AMD again. What will Intel do to gain foothold again (that isn't sabotaging the other)? We need to keep the competition going, or the other side may get too comfortable.

pbsd
3 replies
21h53m

A credible but unconfirmed rumor I've read is that Intel didn't want to do it because of the 512-bit shuffles. The E-cores (including those using the upcoming Skymont microarchitecture) are natively 128-bit and already double-pump AVX2, so to quad-pump AVX-512 would result in many-uops 512-bit shuffle instructions made out of 128-bit uops.

This would render the shuffles unusable, because you'd unpredictably have them costing either 1 uop or cycle to taking 10-30 uops/cycles depending on which core you are on at the moment. A situation similar to PEXT/PDEP, which cost almost nothing on Intel and dozens of cycles on AMD until a couple generations ago.

Why does Zen 4 not have this problem? First, they're only double-pumping instead of quad-pumping. Secondly, while most of their AVX-512 implementation is double-pumped, there seems to be a full-length shuffle unit in there.

Aardwolf
2 replies
21h10m

Interesting stuff! If a shuffle is just crossed wires without any logic (as far as I know shuffle is a completely predetermined bit permutation), wouldn't it fit on the E-cores easily enough?

pbsd
1 replies
21h4m

AVX-512 allows arbitrary shuffles, e.g., shuffle the 64 bytes in zmm0 with indices from zmm1 into zmm2. Simple shuffles like unpacks etc aren't really an issue.

dzaima
0 replies
21h0m

Worse yet (for wiring complexity or required uops, anyway), AVX-512 also has shuffles with two data inputs, i.e. each of the 64 bytes of result can come from any of 128 different input bytes, selected by another 64-byte register.

IWeldMelons
0 replies
6h21m

What they did, is actually even worse: they briefly introduced in Alder Lake AVX-512 FP16 = excellent turbo fast instructions for AI, to immediately drop it.

neonsunset
0 replies
1d2h

Zen 4 AVX512 implementation is not double-pumped and tech journos need to stop calling it that, because it has specific meaning that does not match what takes place.

It simply decodes operations on ZMM registers into multiple uOPS and schedules them to free 256b units. In addition, Zen 4 has special handling of 512b full-width shuffles, with dedicated hardware to avoid doing very expensive emulation. As a result, Zen 4 with its 4 256b SIMD units still acts like a very strong 2x512b core. There is nothing cheap about this implementation and it is probably the best rendition for consumer hardware so far.

kolbe
20 replies
1d19h

Did you try something like this? The autovectorizer looks pretty clean to me.

https://godbolt.org/z/1c5joKK5n

fanf2
19 replies
1d19h

That’s basically the same as `tolower1` - see the bullet points below the graph

kolbe
18 replies
1d18h

Well, the conundrum sits with the fact that this is the disassembly of the main loop:

        vmovdqu64       zmm3, zmmword ptr [rdi + rcx]
        vmovdqu64       zmm4, zmmword ptr [rdi + rcx + 64]
        vmovdqu64       zmm5, zmmword ptr [rdi + rcx + 128]
        vmovdqu64       zmm6, zmmword ptr [rdi + rcx + 192]
        vpaddb  zmm7, zmm3, zmm0
        vpaddb  zmm8, zmm4, zmm0
        vpaddb  zmm9, zmm5, zmm0
        vpaddb  zmm10, zmm6, zmm0
        vpcmpltub       k1, zmm7, zmm1
        vpcmpltub       k2, zmm8, zmm1
        vpcmpltub       k3, zmm9, zmm1
        vpcmpltub       k4, zmm10, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vpaddb  zmm4 {k2}, zmm4, zmm2
        vpaddb  zmm5 {k3}, zmm5, zmm2
        vpaddb  zmm6 {k4}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rdi + rcx], zmm3
        vmovdqu64       zmmword ptr [rdi + rcx + 64], zmm4
        vmovdqu64       zmmword ptr [rdi + rcx + 128], zmm5
        vmovdqu64       zmmword ptr [rdi + rcx + 192], zmm6
...which is, upon first glance, is similar to yet better than the intrinsics version you wrote. Additionally it has cleaner tail handling.

fanf2
16 replies
1d18h

The tail handling is the main point of the post. The tolower1 line on the chart is very spiky because the autovectorizer doesn’t use masked loads and stores for tails, and instead does something slower that tanks performance. The tolower64 line is smoother and rises faster because masked loads and stores make it easier to handle strings shorter than the vector size.

kolbe
15 replies
1d15h

There is something to be said for using masking load/stores instead of downgrading the SIMD register types. But to my eye, something is clearly wrong with your code in such a way that the "naive" autovectorized version should be blowing yours out of the water at 256B. Your code leaves a sparse instruction pipeline, and thus needs to run through the loop 4 times for every one of the autovectorized version. So, something is wrong here, as far as I know. I was just trying to make sure you'd covered the basics, like ensuring you were targeting the right architecture with your build and whatnot.

This is yours. Basically the same instructions, but taking up way more space:

        vmovdqu64       zmm3, zmmword ptr [rsi]
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vmovdqu64       zmmword ptr [rdi], zmm3

dzaima
14 replies
1d14h

Unrolling on AVX-512, especially on Zen 4 with its double-pumped almost-everything, isn't particularly significant; the 512-bit store alone has a reciprocal throughput of 2 cycles/zmm, which gives a pretty comfortable best possible target of 2 cycles/iteration. With Zen 4 being 6-wide out of the op cache, that's fine for up to 12 instructions in the loop (and with separate ports for scalar & SIMD, the per-iteration scalar op overhead is irrelevant).

Interestingly enough, clang 16 doesn't unroll the intrinsics version, but trunk does, which'd make the entire point moot.

The benchmark in question, as per the article, tests over 1MB in blocks, so it'll be at L2 or L3 speeds anyway.

Clang downgrades to ymm, which'll handle a 32-byte tail, but after that it does a plain scalar loop, up to a massive 31 iterations. Whereas masking appears to be approximately free on Zen 4, so I'd be surprised if the scalar tail would be better at even, like, 4 bytes (perhaps there could be some minor throughput gained by splitting up into optional unmasked ymm and then a masked ymm, but even that's probably questionable just from the extra logic/decode overhead).

Also worth considering that in real-world usage the masking version could be significantly better just from being branchless for up to 64-byte inputs.

kolbe
12 replies
1d2h

L2 speeds are ~180GB/s on Zen 4. That's also a part of my confusion. This should be line speed.

I do not have the same experience as you with unrolling AVX512 loops on Zen 4. I recall even with double pumping, you can do 2.5 per cycle. As you noted with the stores, it takes two cycles, so you can put 4 into a 3.25 cycle pipeline, instead of 8 cycles. With 5 dependent ops covering 4.5 cycles, this should be a significant win.

I'm not defending the 31 length loop to clean up the mod32 leftover section. That is bad. But it doesn't answer why 256B is 4x slower than line speed and not significantly faster than the unrolled intrinsic version.

In my experience, I never used the masked load store because some platforms did an actual read over the masked-away parts, and could segfault. I recall hearing from a reliable source that Zen 4 doesn't do that, but didn't see official documentation for it. Clang may actually be avoiding the masked cleanup for that reason.

To top it off, I also always found it faster to just stagger the index back instead of a masked load/store whenever it's longer than 64 on calculations like this. That is, if it's size=80, do 0-63, and then 15-79 (which is an optimization Clang doesn't do either for some reason).

Finally, what really really confuses me is that whenever I write a benchmark like this:

         for(size_t i = 0; i < sizeof(src); i++) {
                 src[i] = (uint8_t)(i & 63) + 32;
         }
-O3 will do something absurd like just return the answer, since the inputs are constexpr. I can understand why the intrinsic version might confuse the compiler, but the clearly written one should totally have broken the benchmark and overwritten it with a constexpr answer.

dzaima
6 replies
1d1h

Also of concern is that the input, and thus the loads & stores here, are intentionally bumped to cycle through all possible alignments, thus ending up unaligned most of the time, in which case it should be 3 cycles per store (I think?).

I don't understand your point about pipelining - OoO should mean that, as long as there's enough decode bandwidth and per-iteration scalar overhead doesn't overwhelm scalar execution resources, all SIMD ops can run at full force up to the most contended resource (store here), no?

That said, yeah, ~44GB/s is actually still pretty slow here, even for L3.

Masked load/store faulting was problematic on AVX2 (in addition to being pretty slow on AMD (which actually continues into Zen 4, despite it having fast AVX-512 versions)); AVX-512's should always be fine, and compilers already output them: https://godbolt.org/z/98sY57TE1

Intrinsics shouldn't "confuse" clang - clang lowers them, where possible, to the same LLVM instructions that the autovectorizer would generate. Both clang and gcc can even convert an intrinsics-based memcpy/memset impl to a libc call (as annoying may that be)!

If you want a compiler to not optimize out computation, you can add something like `__asm__ volatile(""::"r"(src):"memory");` after the loop to make it operate as if the contents of src were modified/read.

kolbe
4 replies
1d

I don't understand your point about pipelining - OoO should mean that, as long as there's enough decode bandwidth and per-iteration scalar overhead doesn't overwhelm scalar execution resources, all SIMD ops can run at full force up to the most contended resource (store here), no?

You are reaching the limits of my understanding, but my level of knowledge is that store may have reciprocal throughput of 2, but it only occupies two ops (from double pumping a single one) over those two cycles, while the CPU pipeline can handle doing 10. For store in particular, nothing is dependent on it completing, so it can be "thrown into the wind" so to speak. But here's my approximation of the pipeline of a single thread, where dashes separate ops

LOADU.0 - LOADU.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - CMP.0 - CMP.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - STORE.0 - STORE.1 - [start again, because nothing is dependent on STORE completing]

So, that's 10 ops and 12 empty spots that can be filled by simultaneously doing 1.2 more loops simultaneously.

I do want to know why clang isn't using the masked load/store. If it's willing to do it on a dot-product, it should do it here as well. It makes me want to figure out what is blocking it (usually some guarantee that 99.9% of developers don't know they're making).

dzaima
2 replies
1d

The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

With out-of-order execution, the layout of instructions in the source just doesn't matter at all - the CPU will hold multiple iterations of the loop in the reorder buffer, and assign execution units from multiple iterations.

e.g. see: https://uica.uops.info/?code=vmovdqu64%20zmm3%2C%20zmmword%2...

(click run, then Open Trace); That's Tiger Lake, not Zen 4, but still displays how instructions from multiple iterations execute in parallel. Zen 4's double-pumping doesn't change the big picture, only essentially meaning that each zmm instr is split into two ymm ones (they might not even need to be on the same port, i.e. double-pumping is really the wrong term, but whatever).

kolbe
1 replies
22h19m

The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.

That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:

Unrolled 4x: uiCA 6.00 llvm-mca 6.20

Regular: uiCA 2.00 llvm-mca 2.40

LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.

dzaima
0 replies
22h9m

Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).

The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.

Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.

dzaima
0 replies
23h53m

On clang masked load/store tail - there are some flags to change its behavior, but seems like nothing does just a masked tail, and what does exist still has work to do, e.g. here's a magic incantation that I found looking through LLVM's source code to make it do a single body for the entire loop: https://godbolt.org/z/1Kb6qTx31. It's hilariously inefficient though, and likely meant for SVE which is intentionally designed for things like this.

fanf2
0 replies
1d

Clang rewrites my tolower64() into something rather more clever, and it adds a 2x unrolled version. (gcc keeps the object code much closer to what I wrote.) But from my point of view the important thing is that Clang retains the masked load and store for short string fragments. Its autovectorizer is not able to generate masked loads and stores.

I used Clang 11 for copybytes64() because it is unable to recognize the memcpy() idiom, whereas Clang 16 does turn it into memcpy() - which is slower!

IWeldMelons
2 replies
6h38m

Never heard about reading the masked area. Source?

dzaima
1 replies
5h7m

AMD64 manual volume 4 (https://www.amd.com/content/dam/amd/en/documents/processor-t...):

Exception and trap behavior for elements not selected for loading or storing from/to memory is implementation dependent. For instance, a given implementation may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written.
IWeldMelons
0 replies
1h42m

I wonder if it ever happens IRL, as this would severely restrict the usefulness.

EDIT: does not seem to be applicable to AVX-512, only to AVX 1.

According to this PDF by intel, AVX-512 suppresses faults for masked access.

https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=g...

fanf2
0 replies
1d1h

The benchmark measures the time to copy about 1 MiByte, in chunks of various lengths from 1 byte to 1 kilobyte. I wanted to take into account differences in alignment in the source and destination strings, so there are a few bytes between each source and destination string, which are not counted as part of the megabyte.

On my Zen 4 CPU the L2 cache is 1 MiB per core, so because the working set is over 2 MB I think the relevant speed limit is the L3 cache.

To be sure I was measuring what I thought I was, I compiled each function separately to avoid interference from inlining, code motion, loop fusion, etc. Of course in real code it's more likely that you would want to encourage inlining, not prevent it!

fanf2
0 replies
1d11h

Yeah the spikes in the “tolower1” line illustrate the 32 byte tail pretty nicely.

I should maybe draw a version of the chart covering just small strings, but it’s an SVG so you can zoom right in. The “tolower1” line shows relatively poor performance compared to “tolower64”, tho it is hard to see for strings less than 8 bytes.

harry8
0 replies
1d17h

The autovectorizer can, at times, when you check it produce reasonable simd code with a given compiler based on simple, clear C code. Is gcc as good in this case? Are previous versions of clang (that may be employed by users) going to work out as well. How do you check in your build that the compiler did the right thing? If you touch the C code in any way will it silently do something you don't want?

Autovectorization is great and improving.

Joker_vD
12 replies
1d18h

...you know, while I personally think that the RISC approach was an honest mistake, stuff like this makes me see why some people wanted to got rid of complex instructions.

Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?

pca006132
4 replies
1d18h

RISC-V has SIMD extension as well. Even when there is no SIMD, prefetching or instruction selection/scheduling will have a big impact on the performance, so it is unlikely one can easily write a few lines of assembly and get to a similar level of performance.

Narishma
3 replies
1d3h

I don't think RISC-V's SIMD extension is very popular. At least I can't think of any available core implementing it. The vector extension is much more common.

Narishma
0 replies
21h27m

That's the vector extension (V) rather than packed SIMD (P).

Findecanor
0 replies
1d2h

The "P" (Packed SIMD) extension is still under development. It uses GPRs and is intended for smaller cores for which V would be too heavyweight.

The proposal originates with Andes, and one of their own ISAs. They have several RISC-V cores with an early draft of it.

dralley
2 replies
1d17h

Do the RISC-V vector instructions cover the whole gamut that x86 does? (or at least the modern AVX-512 / AVX-10 coding style)

dzaima
1 replies
1d17h

RVV has: masking for everything (though for things like loop tail handling (or even main body) using VL is better and much nicer); all the usual int & FP widths; indexed (gather & scatter) & strided & segmented loads & stores (all potentially masked); all operations support all types where at all possible - including integer division of all widths, and three sign variations for the high half of the 128-bit result of a 64-bit int multiply; And (of course) has 8-bit shifts, which AVX-512 somehow doesn't have.

All while being scalable, i.e. minimum vector register width (VLEN) is 128-bit for the 'v' extension, but hardware can implement up to 65536-bit vectors (and software can choose to either pretend they're 128-bit, or can be written such that it portably scales automatically); and if you want more than 128 bits portably there's LMUL, allowing grouping registers up to 8 in a group, giving up to at-least-1024-bit registers.

For shuffles it has vrgather, which supports all element width lookups and can move any element to any other element (yes, including at LMUL=8, though as you can imagine it can be expected to slow down quadratically with LMUL; and could even become a problem at LMUL=1 for hardware with large VLEN, whenever that becomes a thing).

fanf2
0 replies
1d17h

Thanks for those details, it sounds like it should be very nice for short strings, and more like SVE than AVX

dzaima
0 replies
1d18h

RISC-V does have RVV, which similarly can do SIMD, has masking, but also has a vector length separate from masks: https://godbolt.org/z/rrEW85snh. Complete with its own set of ~40000 C intrinsics (https://dzaima.github.io/intrinsics-viewer).

Though, granted, RVV is significantly more uniform than AVX-512 (albeit at the cost of not having some useful goodies).

bjoli
0 replies
1d7h

Considering all x86 procwssors I know about use a risc architecture internally I am not sure what actual benefits you get from a cisc.

Tuna-Fish
0 replies
1d11h

Complicated vector instructions like these are not really antithetical to RISC.

The core of modern RISC thought is basically: "The laws of physics mean that no matter how much hardware you throw at it, only some kinds of instructions can be implemented in a performant way. We should only include these kinds of instructions in the instruction set." Then you build more complex operations out of these simple building blocks, but the fact that every instruction provided can be reasonably implemented to run really fast, the CPU itself can be fast.

Masked vector adds belong in the set of instructions that can be implemented to be fast, and that's why they are included in the RVV RISC-V extension. An example of an instruction that cannot be implemented to be fast would be the humble x86 load+add, where you first look up a value in memory, and then add it to a register. The only reasonable way to implement this to be fast is to just split it into two separate operations which are also dispatched separately, and that is precisely what modern x86 does.

neonsunset
9 replies
1d18h

Mask add looks neat! I wish there was a way to directly manipulate AVX512's mask registers in .NET intrinsics but for now we have to live with "recognized idioms".

For anyone interested, the author's core loop in ASM is as compiled by GCC

    .L3:
        vmovdqu8        zmm0, ZMMWORD PTR [rcx+rax]
        vmovdqa64       zmm1, zmm0
        vpcmpb  k1, zmm0, zmm4, 5
        vpcmpb  k0, zmm0, zmm3, 2
        kandq   k1, k1, k0
        vpaddb  zmm1{k1}, zmm0, zmm2
        vmovdqu8        ZMMWORD PTR [rdi+rax], zmm1
        add     rax, 64
        cmp     rax, r8
        jne     .L3
uiCA (CQA/MAQAO) (https://uica.uops.info/, make sure to pick CQA + Ice Lake) says it achieves nice 32B/cycle on Ice Lake. If you multiply by say 3 to match 3 GHz, this gives us almost 96 GiB/s assuming memory access is not a bottleneck (it always is in such algorithms).

But this seems not as close to optimal utilization as it could be. Using Clang instead yields much better, nicely unrolled result with better instruction selection.

    .LBB0_9:
        vmovdqu64       zmm3, zmmword ptr [rsi]
        vmovdqu64       zmm5, zmmword ptr [rsi + 64]
        vmovdqu64       zmm6, zmmword ptr [rsi + 128]
        add     rdx, -512
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm5, zmm0
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx], zmm3
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 64], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 192]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 128], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 256]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 192], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 320]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 256], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 384]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 320], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 448]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        add     rsi, 512
        vmovdqu64       zmmword ptr [rcx + 384], zmm6
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx + 448], zmm5
        add     rcx, 512
        cmp     rdx, 63
        ja      .LBB0_9
This extracts more impressive 42.67B/c, I don't think even L2 cache can sustain such a throughput, but it's nice to know that medium length strings get up/downcased in about the same time it takes light from your screen to reach your cornea.

The core for short lengths there is one instruction less:

    .LBB0_5:
        vmovdqu64       zmm3, zmmword ptr [rsi + rcx]
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vmovdqu64       zmmword ptr [rax + rcx], zmm3
        add     rcx, 64
        cmp     r8, rcx
        jne     .LBB0_5
Some months ago I wrote a similar ASCII in UTF-8 upcase/downcase implementation in C#: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

(the unrolled conversion for below vectorization lengths is required as short strings dominate most codebases so handling it fast is important - the switch compiles to jump table and then branchless fall-through to return)

For now it goes as wide as 256b as it already saturates e.g. Zen 3 or 4 which have only 256x4 SIMD units (even though Zen 4 can do fancy 512b shuffles natively and has very good 512b implementation). The core loop compiles to compact

       G_M48884_IG05:
              vmovups  ymm3, ymmword ptr [rdi+rax]
              vpaddb   ymm4, ymm3, ymm1
              vpcmpgtb ymm4, ymm2, ymm4
              vpand    ymm4, ymm4, ymm0
              vpor     ymm3, ymm4, ymm3
              vmovups  ymmword ptr [rsi+rax], ymm3
              add      rax, 32
              cmp      rax, rdx
              jbe      SHORT G_M48884_IG05
Side by side with C ones: https://godbolt.org/z/eTGYhTPan

I believe you can also achieve similar with 3-instruction conversion with AVX512 with vpternlogd, as when I had access to AVX512 hardware, this is what .NET optimized it to for 256b width + AVX512VL, but strangely enough I can't make it do so for 512b width right now.

You may notice failed SWAR attempt for switch dispatch case and I was wondering what kind of license your posts are distributed under? (gave up on it back then because per-element fall-through was already fast enough, but if yours passes the test suite, I'd love to use it haha)

inopinatus
3 replies
1d17h

Clang and GCC have a differing approach to Intrinsics, and Clang is more likely than GCC to deviate from the Intel guide's specified opcodes & algorithms, and this is particularly noticeable with AVX-512 instructions. It's understandable given their respective architectures. Sometimes the result is an improvement, sometimes it is a detriment.

A couple of years ago I worked on a heavily vectorized project that was intended to compile with either, and wound up maintaining inline asm and .S in the repository for specific targets alongside the C reference version. That made for some ugly Makefile shenanigans, and also meant including benchmarking as part of the test suite. It adds up to considerable maintenance burden, so the takeaway for me was that using Intrinsics as a low-level means to improve on the autovectorizer should be only very sparingly considered.

Edit to add: quick example, from my notes during that project, https://godbolt.org/z/T4Pjhrz5d ; the GCC output is what was expected, the Clang output was a surprise, and noticeably slower in practice, even when inlined. When looped (or similarly if unrolled), uiCA clocks it at 7 cycles to GCC's 4, and this was borne out by their benchmark performance in application code, in which this function was performed a few billion times in the course of a brute-forcing algorithm (which is to say, it mattered). I recall finding other issues where a dive into the LLVM codebase suggested that Clang 16 might be entirely unable to issue some masked AVX-512 instructions due to internal refactorings.

glandium
1 replies
1d14h

Did you file bugs with testcases?

inopinatus
0 replies
1d10h

Sadly didn’t have time (this was not a funded project and I am far from sufficiently up to speed on LLVM internals). I still hope to get around to writing something up during the next break.

Remnant44
0 replies
1d14h

I've run into the same behavior with clang and intrinsics. Well, I appreciate the fact that they're trying to optimize the intrinsics usage, there really does need to be a flag or pragma you can pass that says more along the lines of "no really, give me what I asked for." In some cases I have found that the code it produces is a significant pessimization from what I had selected.

Sesse__
2 replies
1d10h

One thing that's not all that obvious from this large chunk of assembly is that Clang manages to rewrite

  (x >= 'a' && x <= 'z')
into

  (x - 'a') < <some constant>
which saves one instruction (and sometimes, a register load due to weird opcode encoding thingies).

neonsunset
1 replies
1d7h

This is a common transformation for range-check style comparisons. It is always nice to see that LLVM reasons about it in a generalized fashion and applies it to vector operations, and it is what I ultimately ended up on when writing a similar to author's implementation.

Sesse__
0 replies
20h38m

I sometimes write intrinsic comparisons (e.g., something like "x < 'a'" instead of "_mm_cmplt_epi8(x, _mm_set1_epi8('a'))") just to make sure the compiler understands what's going on and can do that sort of transformations. (It also sometimes makes for more code sharing between e.g. x86 and NEON.)

fanf2
1 replies
1d17h

Thanks for that analysis, very informative!

I have not tried to get the best possible performance: at first I wanted to see if it would work at all, and the fact that my first attempt performed really well was a bonus! My main point of interest is strings less than the size of a vector register, and getting rid of the troughs in the throughput chart.

You can click through the link to the code at the end of the blog post, which has all the licence details. It is 0BSD or MIT-0 except for the parts written originally for BIND which are MPL-2.0.

neonsunset
0 replies
1d17h

Just to clarify - this was about SWAR version as my interest, similar to what you mentioned in a sibling comment, is in mainly below-vector-width strings, because these are most frequently passed to these kinds of functions by developers. I was going to try to see if it can further "amortize" the jump table with fallthrough for variable lengths below 16 in my C# UTF-8 string implementation that is going to use it.

In any case, thanks for writing the article, for some reason there is quite a bit of negativity among users and developers towards AVX512 - it is seen as "AVX2 but twice as wide", which it of course isn't, but most do not know that.

Also, you may want to simply do the overlapping conversion for the tail instead of masking out the elements, which is usually faster on the benchmarks (at least give it a try), and also align the source and destination for the loop to avoid split loads/stores.

xyst
3 replies
1d17h

Now account for Unicode characters (ie, Â -> â) and I’ll be impressed.

Shame most programmers only care about ASCII. There is a whole world that exists outside of the standard [a-z,A-Z,0-9] character set

rurban
2 replies
1d13h

He wrote about tolower(), not towlower()/wcslwr().

Brananarchy
1 replies
1d12h

C tolower() is locale aware and handles utf-8 as well as many other encodings

fanf2
0 replies
1d10h

It can only handle single-byte character sets, not UTF-8

ipunchghosts
2 replies
1d18h

I'm lost, what is swar?

singron
0 replies
1d18h

"SIMD Within A Register"

I think the implication is that you can pack multiple items into an ordinary register and effectively get SIMD even if you aren't using explicit SIMD instructions. E.g. if you pack a 31 and 32 bit number into a 64 bit register (you need 1 spare for a carry bit), you can do 2 adds with a single 64-bit add.

Games have used these tricks for graphics to pack RGB(A) values into 32 bit integers. E.g. this code from scummvm interpolates 2 16-bit RGB pixels (6 total components) packed into a 32-bit value. https://github.com/scummvm/scummvm/blob/master/graphics/scal...

paulryanrogers
0 replies
1d18h

SIMD within a register

LelouBil
2 replies
1d18h

Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_uppper is false, and when is_upper is true the result is c + to_upper.

Is that an error in the post ? Shouldnt it do the addition when is_upper is false and copy the same when it is true ?

jmalicki
0 replies
1d18h

The operation is `tolower`

Capital a is 0x40, lowercase is 0x60.

The addition of 0x20 needs to happen when is_upper is true.

fanf2
0 replies
1d8h

D'oh, I must have seen your comment 10 times before I realised the to_upper variable is named backwards and should be called to_lower. Thanks for pointing out that it was confusing! I've fixed the post and the code.

dolmen
1 replies
1d9h

Unfortunately those SWAR optimizations are only useful for strings that are aligned on 8 bytes address.

If your SWAR algorithm is applied on a non-aligned string, it is often slower than the original algorithm.

And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.

Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1

Findecanor
0 replies
1d8h

Masked SIMD operations, which are in AVX-512 and ARM SVE were intended to solve that problem. Then memory operations could still be aligned and of full vectors all the time, but masked to only those elements that are valid.

Even if a masked vector-memory operation is unaligned and crosses into an unmapped or protected page, that will not cause a fault if those lanes are masked off. There are even special load instructions that will reduce the vector length to end at the first element that would have caused a fault, for operations such as strlen() where the length is not known beforehand.

rowanG077
0 replies
1d1h

In the past I have added a black border around an image to be able to completely avoid the read beyond buffer SIMD problem. It worked very well and allowed us to beat some opencv implementation in terms of speed. But you don't always have full control over the input like that.