Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.
It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.
Changing ABI with optimization setting interacts really badly with separate compilation.
Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.
The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.
Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.
Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.
Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.
You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.
The LLVM calling conventions for x86 only allow returning 3 integer registers, 4 vector registers, and 2 x87 floating point registers (er, stack slots technically because x87 is weird).
Limiting a newly designed Rust ABI to whatever LLVM happens to support at the moment seems unnecessarily limiting. Yeah, you'd need to write some C++ to implement it, but that's not the end of the world, especially compared to getting stuck with arbitrary limits in your ABI for the next decade or two.
This sort of thing is why integer division by 0 is UB in rust on targets where it's not UB, because it's UB in LLVM :)
I stared at this really hard, and I eventually couldn't figure out what you mean here.
Obviously naively just dividing integers by zero in Rust will panic, because that's what is defined to happen.
So you have to be thinking about a specific case where it's defined not to panic. But, what case? There isn't an unchecked_div defined on the integers. The wrapping and saturating variants panic for division by zero, as do the various edge cases like div_floor
What case are you thinking of where "integer division by 0 is UB in Rust" ?
The poster is both correct and incorrect. It definitely is true that LLVM only has two instructions to deal with division, udiv and sdiv specifically, and it used to be the case that Rust as a consequence had UB when encountering division by zero as a result as those two instructions consider that operation UB.
But Rust has solved this problem by inserting a check before every division that reasonably could get a division by zero (might even be all operations, I don't know the specifics), which checks for zero and defines the consequences.
So as a result divisions aren't just divisions in Rust, they come with an additional check as overhead, but they aren't UB either.
Oh, I see, yes obviously if you know your value isn't zero, that's what the NonZero types are for, and these of course don't emit a check because it's unnecessary.
Sure, and if you actually want a branchless integer division for an arbitrary input, which is defined for the entire input domain on x64, then to get it you'll have to pull some trick like reinterpreting a zeroable type as a nonzero one, heading straight through LLVM IR UB on your way to the defined behavior on x64.
By the way: Don't actually do this. The LLVM IR is not defined to do what you wanted, and even if it works today, and it worked yesterday it might just stop working tomorrow, or on a different CPU model or with different optimisation settings.
If what you want is "Whatever happens when I execute this CPU instruction" you can literally write that in Rust today and that will do what you wanted. Invoking UB because you're sure you know better is how you end up with mysterious bugs.
This reminds me of people writing very crazy unsafe Rust to try to reproduce the "Quake fast inverse square root" even though um, you can just write that exact routine in safe Rust and it's guaranteed to do exactly what you meant with the IEEE re-interpretation as integer etc., safely and emitting essentially the same machine code on x86 - not even mentioning that's not how to calculate an inverse square root quickly today because Quake was a long time ago and your CPU is much better today than the ones Carmack wrote that code for.
Sure. That would be an instance of the "usual error". The argument registers are usually caller save, where any unused ones get treated as scratch in the callee, in which case making them all available for returning data as well is zero cost.
There's no reason not to, other than C makes returning multiple things awkward and splitting a struct across multiple registers is slightly annoying for the compiler.
Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?
Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.
(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.
It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.
Due to speculative out-of-order execution, it's not just "a few cycles". The LSU has a hard, small, limit on the number of outstanding loads and stores (usually separate limits, on the order of 8-32) and once you fill that, you have to stop issuing until commit has drained them.
This discussion is yet another instance of the fallacy of "Intel has optimized for the current code so let's not improve it!". Other examples include branch prediction (correctly predicted branch as a small but not zero cost) and indirect jump prediction. And this doesn't even begin to address implementations that might be less aggressive about making up for bad code (like most RISCs and RISC-likes).
More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo
Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.
Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.
Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.
The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.
And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.
Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).
Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.
Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.
Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.
I don't have a strong sense of how much more common owned `Option` types are than references, but it's worth noting that if `T` is a reference, `Option<T>` will just use a pointer and treat the null value as `None` under the hood to avoid needing any tag. There are probably other types where this is done as well (maybe `NonZero` integer types?)
Yeah, `NonZero*` but also a type like `#[repr(u8)] enum Foo{ X }`, according to `assert_eq!(std::mem::size_of::<Option<Foo>(), std::mem::size_of::<Foo>())` you need an enum which fully saturates the repr, e.g. `#[repr(u8)]Bar { X0, ... X255}` (pseudo code) before niche optimization fails to kick in.
Oh, good to know!
Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?
And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.
In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.
If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).
I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.
Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.
I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.
I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.
Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.
DWARF doesn't encode bespoke calling conventions at all today.