return to table of content

The Rust calling convention we deserve

JonChesterfield
26 replies
1d17h

Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

Changing ABI with optimization setting interacts really badly with separate compilation.

Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.

The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.

Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.

LegionMammal978
10 replies
1d17h

This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.

Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.

JonChesterfield
9 replies
1d17h

You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.

jcranmer
8 replies
1d17h

The LLVM calling conventions for x86 only allow returning 3 integer registers, 4 vector registers, and 2 x87 floating point registers (er, stack slots technically because x87 is weird).

Denvercoder9
6 replies
1d10h

Limiting a newly designed Rust ABI to whatever LLVM happens to support at the moment seems unnecessarily limiting. Yeah, you'd need to write some C++ to implement it, but that's not the end of the world, especially compared to getting stuck with arbitrary limits in your ABI for the next decade or two.

anonymoushn
5 replies
1d9h

This sort of thing is why integer division by 0 is UB in rust on targets where it's not UB, because it's UB in LLVM :)

tialaramex
4 replies
1d3h

I stared at this really hard, and I eventually couldn't figure out what you mean here.

Obviously naively just dividing integers by zero in Rust will panic, because that's what is defined to happen.

So you have to be thinking about a specific case where it's defined not to panic. But, what case? There isn't an unchecked_div defined on the integers. The wrapping and saturating variants panic for division by zero, as do the various edge cases like div_floor

What case are you thinking of where "integer division by 0 is UB in Rust" ?

Marzepansion
3 replies
1d1h

The poster is both correct and incorrect. It definitely is true that LLVM only has two instructions to deal with division, udiv and sdiv specifically, and it used to be the case that Rust as a consequence had UB when encountering division by zero as a result as those two instructions consider that operation UB.

But Rust has solved this problem by inserting a check before every division that reasonably could get a division by zero (might even be all operations, I don't know the specifics), which checks for zero and defines the consequences.

So as a result divisions aren't just divisions in Rust, they come with an additional check as overhead, but they aren't UB either.

tialaramex
2 replies
1d

Oh, I see, yes obviously if you know your value isn't zero, that's what the NonZero types are for, and these of course don't emit a check because it's unnecessary.

anonymoushn
1 replies
22h38m

Sure, and if you actually want a branchless integer division for an arbitrary input, which is defined for the entire input domain on x64, then to get it you'll have to pull some trick like reinterpreting a zeroable type as a nonzero one, heading straight through LLVM IR UB on your way to the defined behavior on x64.

tialaramex
0 replies
4h49m

By the way: Don't actually do this. The LLVM IR is not defined to do what you wanted, and even if it works today, and it worked yesterday it might just stop working tomorrow, or on a different CPU model or with different optimisation settings.

If what you want is "Whatever happens when I execute this CPU instruction" you can literally write that in Rust today and that will do what you wanted. Invoking UB because you're sure you know better is how you end up with mysterious bugs.

This reminds me of people writing very crazy unsafe Rust to try to reproduce the "Quake fast inverse square root" even though um, you can just write that exact routine in safe Rust and it's guaranteed to do exactly what you meant with the IEEE re-interpretation as integer etc., safely and emitting essentially the same machine code on x86 - not even mentioning that's not how to calculate an inverse square root quickly today because Quake was a long time ago and your CPU is much better today than the ones Carmack wrote that code for.

JonChesterfield
0 replies
1d9h

Sure. That would be an instance of the "usual error". The argument registers are usually caller save, where any unused ones get treated as scratch in the callee, in which case making them all available for returning data as well is zero cost.

There's no reason not to, other than C makes returning multiple things awkward and splitting a struct across multiple registers is slightly annoying for the compiler.

rayiner
6 replies
1d17h

Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?

kevingadd
2 replies
1d13h

Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.

(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.

rayiner
1 replies
1d5h

It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.

FullyFunctional
0 replies
1d1h

Due to speculative out-of-order execution, it's not just "a few cycles". The LSU has a hard, small, limit on the number of outstanding loads and stores (usually separate limits, on the order of 8-32) and once you fill that, you have to stop issuing until commit has drained them.

This discussion is yet another instance of the fallacy of "Intel has optimized for the current code so let's not improve it!". Other examples include branch prediction (correctly predicted branch as a small but not zero cost) and indirect jump prediction. And this doesn't even begin to address implementations that might be less aggressive about making up for bad code (like most RISCs and RISC-likes).

dwattttt
1 replies
1d16h

More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo

eru
0 replies
1d14h

Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.

pcwalton
0 replies
1d1h

Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.

Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.

t0b1
4 replies
1d15h

The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.

And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.

dzaima
3 replies
1d15h

Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).

Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.

Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.

Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.

saghm
2 replies
1d14h

Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2

I don't have a strong sense of how much more common owned `Option` types are than references, but it's worth noting that if `T` is a reference, `Option<T>` will just use a pointer and treat the null value as `None` under the hood to avoid needing any tag. There are probably other types where this is done as well (maybe `NonZero` integer types?)

ratmice
1 replies
1d7h

Yeah, `NonZero*` but also a type like `#[repr(u8)] enum Foo{ X }`, according to `assert_eq!(std::mem::size_of::<Option<Foo>(), std::mem::size_of::<Foo>())` you need an enum which fully saturates the repr, e.g. `#[repr(u8)]Bar { X0, ... X255}` (pseudo code) before niche optimization fails to kick in.

saghm
0 replies
1d1h

Oh, good to know!

workingjubilee
1 replies
1d12h

Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?

And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.

In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.

JonChesterfield
0 replies
1d3h

If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).

I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.

Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.

I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.

I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.

Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.

khuey
0 replies
1d17h

It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

DWARF doesn't encode bespoke calling conventions at all today.

pizlonator
25 replies
1d13h

The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.

Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.

One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.

Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.

Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.

Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.

(Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)

weinzierl
10 replies
1d13h

I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.

That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.

You said it yourself: "Another reason is that the CPUs of today are optimized [..]"

The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.

Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.

Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.

[1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.

workingjubilee
5 replies
1d12h

The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.

weinzierl
3 replies
1d12h

I know, but from what I understand there are initiatives to stabilize the ABI, which would also mean stabilizing calling conventions. I read the article in that broader context, even if it does not talk about that directly.

JoshTriplett
2 replies
1d9h

There's no proposal to stabilize the Rust ABI. There are proposals to define a separate stable ABI, which would not be the default ABI. (Such a separate stable ABI would want to plan for long-term performance, but the default ABI could continue to improve.)

zozbot234
0 replies
1d

There is already a separate stable ABI, it's just the C ABI. There are also multiple crates that address the issue of stable ABIs for Rust code. It's not very clear why compiler involvement would be required for this.

weinzierl
0 replies
1d1h

Thanks, I never had considered that a possibility when hearing about "Rust stable ABI", but it makes a lot if sense.

dathinab
0 replies
1d5h

If I remember correctly there is a bit of difference between explicit `extern "rust"` and no explicit calling convention but I'm not so sure.

Anyway at least when not using explicit rust representation Rust doesn't even guarantee that the layout of a struct is the same for two repeated build _with the same compiler and code_. That is very intentionally and I think there is no intend to change that "in general" (but various subsets might be standarized, like `Option<&T> where T: Sized` mapping `None` to a null pointer allowing you to use it in C-FFI is already a de facto standard). Which as far as I remember is where explicit extern rust comes in to make sure that we can have a prebuild libstd, it still can change with _any_ compiler version including patch versions. E.g. a hypothetical 1.100 and 1.100.1 might not have the same unstable rust calling convention.

flohofwoe
1 replies
1d11h

and a calling convention should be designed for the long term

...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).

weinzierl
0 replies
1d6h

You are right, as Josh Triplett also pointed out above. I was mistaken about the plans to stabilize the ABI.

Ygg2
1 replies
1d4h

means that it is beneficial to not deviate too much from what C++ does

Or just C.

Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .

mananaysiempre
0 replies
23h38m

Huh, I thought they fixed that (the PCMPISTR? string instructions from SSE4.2 being significantly faster than PCMPESTR?), but looks like the explicit-length version still takes twice as many uops on recent Intel and AMD CPUs. They don’t seem to get much use nowadays anyway, though, what with being stuck in the 128-bit realm (there’s a VEX encoding but that’s basically it).

mkj
4 replies
1d13h

And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.

fleventynine
2 replies
1d10h

The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.

planede
1 replies
1d9h

There were proposals for optimizing this kind of stuff for C++ in particular for error handling, like:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p07...

Throwing such values behaves as-if the function returned union{R;E;}+bool where on success the function returns the normal return value R and on error the function returns the error value type E, both in the same return channel including using the same registers. The discriminant can use an unused CPU flag or a register
mananaysiempre
0 replies
23h35m

IBM BIOS and MS DOS calls used the carry flag as a boolean return or error indicator (the 8086 has instructions for manually setting and resetting it). I don’t think people do that these days except in manually-coded assembly, unfortunately (which the relevant parts of both BIOS and DOS also were of course).

pizlonator
0 replies
1d1h

Also a thing you gotta measure.

Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.

workingjubilee
2 replies
1d13h

Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.

saagarjha
0 replies
1d10h

The people who write JITs also write a bunch of C++ that gets statically compiled.

pizlonator
0 replies
1d4h

LLVM inlines even more than my JIT does.

The JIT has both more and less information.

It has more information about the program globally. There is no linking or “extern” boundary.

But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.

One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.

(I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)

CalChris
2 replies
1d4h

"If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."

The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.

Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?

pizlonator
0 replies
15h34m

Yes.

If you flatten big structs into registers to pass them you have a bad time on armv8.

I tried. That was an llvm experiment. Ahead of time compiler for a modified version of C.

leni536
1 replies
1d11h

Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.

It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.

__s
0 replies
1d5h

Dynamic calling conventions also won't work with dynamic linking

amelius
0 replies
1d2h

If you want fast, then you probably need to have a different calling convention per call.

yogorenapan
22 replies
1d18h

Tangentially related: Is it currently possible to have interop between Go and Rust? I remember seeing someone achieving it with Zig in the middle but can’t for the sake of me find it. Have some legacy Rust code (what??) that I’m hoping to slowly port to Go piece by piece

duped
16 replies
1d17h

It's usually unwise to mix managed and unmanaged memory since the managed code needs to be able to own the memory its freeing and moving whereas the unmanaged code needs to reason about when memory is freed or moved. cgo (and other variants) let you mix FFI calls into unmanaged memory from managed code in Go, but you pay a penalty for it.

In language implementations where GC isn't shared by the different languages calling each other you're always going to have this problem. Mixing managed/unmanaged code is both an old idea and actively researched.

It's almost always a terrible idea to call into managed code from unmanaged code unless you're working directly with an embedded runtime that's been designed for it. And when you do, there's usually a serialization layer in between.

neonsunset
14 replies
1d16h

Mixing managed and unmanaged code being an issue is simply not true in programming in general.

It may be an issue in Go or Java, but it just isn't in C# or Swift.

Calling `write` in C# on Unix is as easy as the following snippet and has almost no overhead:

    var text = "Hello, World!\n"u8;
    Interop.Write(1, text, text.Length);

    static unsafe partial class Interop
    {
        [LibraryImport("libc", EntryPoint = "write")]
        public static partial void Write(
            nint fd, ReadOnlySpan<byte> buffer, nint length);
    }
In addition, unmanaged->managed calls are also rarely an issue, both via function pointers and plain C exports if you build a binary with NativeAOT:

    public static class Exports
    {
        [UnmanagedCallersOnly(EntryPoint = "sum")]
        public static nint Sum(nint a, nint b) => a + b;
    }
It is indeed true that more complex scenarios may require some form of bespoke embedding/hosting of the runtime, but that is more of a peculiarity of Go and Java, not an actual technical limitation.

meindnoch
5 replies
1d8h

Swift is not a "managed" (i.e. GC) language.

meindnoch
3 replies
1d4h

I was expecting this pedantic comment... If refcounting makes a language "managed", then C++ with shared_ptr is also "managed".

_______

The charitable interpretation is that OP was likely referring to the issues when calling into a language with a relocating GC (because you need to tell the GC not to move objects while you're working with them), which Swift is not.

pjmlp
1 replies
1d3h

Nope, because that is a library class without any language support.

The pedantic comment is a synonymous with proper education instead of street urban myths.

meindnoch
0 replies
1d1h

It is a library class, because C++ is a rich enough language to implement automatic refcounting as a library class, by hooking into the appropriate lifecycle methods (copy ctor, dtor).

neonsunset
0 replies
1d3h

Swift has just as much concerns for its structs and classes passing across FFI in terms of marshalling/unmarshalling and ensuring the ARC-unaware code performs either manual retain/release calls or adapts them to whatever other mechanism of memory management of the callee.

One of the comments here mentions that Swift has its own stable ABI, which exposes richer type system, so it does stand out in terms of interop (.NET 9 will add support for it natively (library evolution ABI) without having to go through C calls or C "glue" code on iOS and macOS, maybe the first non-Swift/ObjC platform to do so?).

Object pinning in .NET is only a part of equation and at this point far from the biggest concern (it just works, like it did 15 years ago, maybe it's a matter of fuss elsewhere?).

jcranmer
3 replies
1d16h

That's not the direction being talked about here. Try calling the C# method from C or C++ or Rust.

(I somewhat recently did try setting up mono to be able to do this... it wasn't fun.)

neonsunset
2 replies
1d16h

What you may have been looking for is these:

- https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...

- https://github.com/dotnet/samples/blob/main/core/nativeaot/N...

With that said, Mono has been a staple choice for embedding in game-script style scenarios, in particular, because of the ability to directly call its methods inside (provided the caller honors the calling convention correctly), but it has been slowly becoming more of a liability as you are missing out on a lot of performance by not hosting CoreCLR instead.

For .dll/.so/.dylib's, it is easier and often better to just build a native library with naot instead (the links above, you can also produce statically linkable binaries but it might have issues on e.g. macOS which has...not the most reliable linker that likes to take breaking changes).

This type of library works in almost every scenario a library implemented in C/C++/Rust with C exports does. For example, here someone implemented a hello-world demonstration of using C# to write an OBS plugin: https://sharovarskyi.com/blog/posts/dotnet-obs-plugin-with-n...

Using the exports boils down to just this https://github.com/kostya9/DotnetObsPluginWithNativeAOT/blob... and specifying correct build flags.

duped
1 replies
1d14h

I haven't been looking for those because I don't work with .NET. Regardless, what you're linking still needs callers and callees to agree on calling convention and special binding annotations across FFI boundaries which isn't particularly interesting from the perspective of language implementation like the promises of Graal or WASM + GC + component model.

neonsunset
0 replies
1d13h

There is no free lunch. WASM just means another lowest common denominator abstraction for FFI. I'm also looking forward to WASM getting actually good so .NET could properly target it (because shipping WASM-compiled GC is really, really painful, it works acceptably today, but could be better). Current WasmGC spec is pretty much unusable by any language that has non-primitive GC implementation.

Just please don't run WASM on the server, we're already getting diminishing generational performance gains in hardware, no need to reduce them further.

The exports in the examples follow C ABI with respective OS/ISA-specific calling convention.

pjmlp
2 replies
1d6h

Except that is only true since those attributes were introduced in recent .NET versions, and it doesn't account for COM marshaling issues.

Plenty of .NET code still using the old ways that isn't going to be rewritten, either for these attributes, or the new Cs/WinRT, or the new Core COM interop, which doesn't support all COM use cases anyway.

neonsunset
1 replies
1d4h

Code written for .NET Framework is completely irrelevant to conversation since it does not evaluate it.

You should treat it as dead and move on because it does not impact what .NET can or can’t do.

There is no point to bring up “No, but 10 years ago it was different”. So what? It’s not 2014 anymore.

pjmlp
0 replies
1d3h

My remarks also apply to modern .NET, as those improvements were introduced in .NET 6 and .NET 8, and require a code rewrite to adopt them, instead of the old ways which are also available, in your blind advocacy you happened to miss out.

Very few code gets written from scratch unless we are talking about startups.

duped
0 replies
1d14h

There are more managed langauges than Go, Java, and C#. Swift (and Objective C with ARC) are a bit different in that they don't use mark and sweep/generational GCs for automatic memory management so it's significantly less of an issue. Compare with Lua, Python, JS, etc where there's a serialization boundary between the two.

But I stand by what I said. It's generally unwise to mix the two, particularly calling unmanaged code from managed code.

I wouldn't say it's "not a problem" because there are very few environments where you don't pay some cost for mixing and matching between managed/unmanaged code, and the environments designed around it are built from first principles to support it, like .NET. More interesting to me are Graal and WASM (once GC support lands) which should make it much easier to deal with.

zozbot234
0 replies
1d

It's usually unwise to mix managed and unmanaged memory

Broadly stated, you can achieve this by marking a managed object as a GC root whenever it's to be referenced by unmanaged code (so that it won't be freed or moved in that case) and adding finalizers whenever managed objects own or hold refcounted references over unmanaged memory (so that the unmanaged code can reason about these objects being freed). But yes, it's a bit fiddly.

neonsunset
0 replies
1d16h

You have to go through C bindings, but FFI is very far from being Go's strongest suit (if we don't count Cgo), so if that's what interests you, it might be better to explore a different language.

mrits
0 replies
1d16h

I have to use Rust and Swift quite a bit. I basically just landed on sending a byte array of serialized protobufs back and forth with cookie cutter function calls. If this is your full time job I can see how you might think that is lame, but I really got tired of coming back to the code every few weeks and not remembering how to do anything.

apendleton
0 replies
1d18h

If you want to call from Go into Rust, you can declare any Rust function as `extern "C"` and then call it the same way you would call C from Go. Not sure about going the other way.

Voultapher
0 replies
1d4h

If you want a particularity cursed example, I've recently called Go code from Rust via C in the middle, including passing a Rust closure with state into the Go code as callback into a Go stdlib function, including panic unwinding from inside the Rust closure https://github.com/Voultapher/sort-research-rs/commit/df6c91....

100k
0 replies
1d17h

Yes, you can use CGO to call Rust functions using extern "C" FFI. I gave a talk about how we use it for GitHub code search at RustConf 2023 (https://www.youtube.com/watch?v=KYdlqhb267c) and afterwards I talked to some other folks (like 1Password) who are doing similar things.

It's not a lot of fun because moving types across the C interop boundary is tedious, but it is possible and allows code reuse.

vrotaru
9 replies
1d12h

There was an interesting aproach to this, in an experimental language some time ago

   fn f1 (x, y) #-> // Use C calling conventions

   fn f2 (x, y) -> // use fast calling conventions
The first one was mostly for interacting with C code, and the compiler knew how to call each function.

magicalhippo
5 replies
1d10h

Delphi, and I'm sure others, have had[1] this for ages:

When you declare a procedure or function, you can specify a calling convention using one of the directives register, pascal, cdecl, stdcall, safecall, and winapi.

As in your example, cdecl is for calling C code, while stdcall/winapi on Windows for calling Windows APIs.

[1]: https://docwiki.embarcadero.com/RADStudio/Sydney/en/Procedur...

pjmlp
3 replies
1d6h

Since the Turbo Pascal days actually.

magicalhippo
2 replies
1d3h

I was pretty sure it had it, I just couldn't find an online reference.

pjmlp
1 replies
12h10m

Turbo Pascal for Windows v1.5, on Windows 3.1, the transition step before Delphi came to be.

magicalhippo
0 replies
7h38m

Ah, yeah that makes sense. Thanks!

JonChesterfield
0 replies
1d3h

C for example does this, albeit in compiler extensions, and with a longer tag than #.

dgellow
1 replies
1d12h

Is it similar to Zig’s callconv keyword?

vrotaru
0 replies
1d11h

Guess so. Unfamiliar with Zig. The point is that not a "all or nothing" strategy for a compilation unit.

Debugger writers may not be happy, but maybe lldb supports all conventions supported by llvm.

IshKebab
0 replies
1d8h

Terrible taste. Why would you hide such an infrequently used feature behind a single character? In this case you should absolutely use a keyword.

Arnavion
6 replies
1d2h

Tangentially related, there's another "unfortunate" detail of Rust that makes some structs bigger than you want them to be. Imagine a struct Foo that contains eight `Option<u8>` fields, ie each field is either `None` or `Some(u8)`. In C, you could represent this as a struct with eight 1-bit `bool`s and eight `uint8_t`s, for a total size of 9 bytes. In Rust however, the struct will be 16 bytes, ie eight sequences of 1-byte discriminant followed by a `uint8_t`.

Why? The reason is that structs must be able to present borrows of their fields, so given a `&Foo` the compiler must allow the construction of a `&Foo::some_field`, which in this case is an `&Option<u8>`. This `&Option<u8>` must obviously look identical to any other `&Option<u8>` in the program. Thus the underlying `Option<u8>` is forced to have the same layout as any other `Option<u8>` in the program, ie its own personal discriminant bit rounded up to a byte followed by its `u8`. The struct pays this price even if the program never actually constructs a `&Foo::some_field`.

This becomes even worse if you consider Options of larger types, like a struct with eight `Option<u16>` fields. Then each personal discriminant will be rounded up to two bytes, for a total size of 32 bytes with a quarter (or almost half, if you include the unused bits of the discriminants) being wasted interstitial padding. The C equivalent would only be 18 bytes. With `Option<u64>`, the Rust struct would be 128 bytes while the C struct would be 72 bytes.

You *can* implement the C equivalent manually of course, with a `u8` for the packed discriminants and eight `MaybeUninit<T>`s, and functions that map from `&Foo` to `Option<&T>`, `&mut Foo` to `Option<&mut T>`, etc, but not to `&Option<T>` or `&mut Option<T>`.

https://play.rust-lang.org/?version=stable&mode=debug&editio...

Arcuru
3 replies
1d2h

You have to implement the C version manually, so it's not that odd you'd need to do the same for Rust?

You've described, basically, a custom type that is 8 Options<u8>s. If you start caring about performance you'll need to roll your own internal Option handling.

Arnavion
2 replies
1d2h

You have to implement the C version manually

There's no "manually" about it. There's only one way to implement it in C, ie eight booleans and eight uint8_ts as I described. Going from there to the further optimization of adding a `:1` to every `bool` field is a simple optimization. Reimplementing `Option` and the bitpacking of the discriminants is much more effort compared to the baseline implementation of using `Option`.

surajrmal
0 replies
4h23m

But it's not any more work than it would take in C. What does it matter how much work it is relative to rust's happy path?

IshKebab
0 replies
22h17m

The alternative is `std::optional` which works exactly the same as Rust's `Option` (without the niche optimisation).

I'm not a C programmer but I imagine you could make something like `std::optional` in C using structs and macros and whatnot.

Aurornis
1 replies
1d1h

You can implement the C equivalent manually of course

But you have to implement the C version manually as well.

It's not really a downside to Rust if it provides a convenient feature that you can choose to use if it fits your goals.

The use case you're describing is relatively rare. If it's an actual performance bottleneck then spending a little extra time to implement it in Rust doesn't seem like a big deal. I have a hard time considering this an "unfortunate detail" to Rust when the presence of the Option<_> type provides so much benefit in typical use cases.

Arnavion
0 replies
1d1h

I answered this in the other subthread already.

zamalek
5 replies
1d10h

Debuggers

Simply throw it in as a Cargo.toml flag and sidestep the worry. Yes, you do sometimes have to debug release code - but there you can use the not-quite-perfect debugging that the author mentions.

Also, why aren't we size-sorting fields already? That seems like an easy optimization, and can be turned off with a repr.

zamalek
1 replies
23h3m

Oh wow, awesome! I was somewhat considering this being my first contribution - glad someone already tackled it!

fl0ki
0 replies
22h15m

If I can suggest, the next big breakthrough in this space would be generalizing niche filling optimization. Every thread about this seems to fizzle out, to the point that I couldn't even find which one is the latest any more.

Today most data-carrying enums decay into the lowest common denominator of a 1-byte discriminant padded by 7 more bytes before any variant's data payload can begin. This really adds up when enums are nested, not just blowing out the calling register set but also making each data structure a swiss cheese of padding bytes.

Even a few more improvements in that space would have enormous impact, compounding with other optimizations like better calling conventions.

JonChesterfield
1 replies
1d3h

Did you want alignment sorting? In general the problem with things like that is the ideal layout is usually architecture and application specific - if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.

zamalek
0 replies
1d

Did you want alignment sorting?

Yep. It will probably improve (to be measured) the 80%. Less memory means less bandwidth usage etc.

if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.

I did suggest having a repr for situations like yours. Something like #[repr(yeet)]. Optimizing for false sharing etc. is probably well within 5% of code that exists today, and is usually wrapped up in a library that presents a specific data structure.

sheepscreek
4 replies
1d18h

Very interesting but pretty quickly went over my head. I have a question that is slightly related to SIMD and LLVM.

Can someone explain simply where does MLIR fit into all of this? Does it standardize more advanced operations across programming languages - such as linear algebra and convolutions?

Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

jcranmer
1 replies
1d17h

Can someone explain simply where does MLIR fit into all of this?

It doesn't.

MLIR is a design for a family of intermediate languages (called 'dialects') that allow you to progressively lower high-level languages into low-level code.

fl0ki
0 replies
23h23m

The ML media cycle is so unhinged that I've seen people simply assume out of hand that MLIR stands for Machine Learning Intermediate Representation.

jadodev
0 replies
1d15h

MLIR includes a "linalg" dialect that contains common operations. You can see those here: https://mlir.llvm.org/docs/Dialects/Linalg/

This post is rather unrelated. The linalg dialect can be lowered to LLVM IR, SPIR-V, or you could write your own pass to lower it to e.g. your custom chip.

fpgamlirfanboy
0 replies
1d16h

Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

Are people getting paid to repeat this ad nauseum?

quotemstr
4 replies
1d17h

The C calling convention kind of sucks. True, can't change the C calling convention, but that doesn't make it any less unfortunate.

We should use every available caller-saved register for arguments and return values, but in the traditional SysV ABI, we use only one register (sometimes two) for return values. If you return a struct Point3D { long x, y, z }, you spill the stack even though we could damned well put Point3D in rax, rdi, and rsi.

There are other tricks other systems use. For example, if I recall correctly, in SBCL, functions set the carry flag on exit if they're returning multiple values. Wouldn't it be nice if we used the carry flag in indicate, e.g. whether a Result contains an error.

fch42
3 replies
1d11h

"sucks" is a strong word but with respect to return values, you're right. The C calling conventions, everywhere really, support what C supports - returning one argument. Well, not even that (struct returns ... nope). Kind of "who'd have thought" in C I guess. And then there's the C++ argument "just make it inline then".

On the other hand, memory spills happen. For SPARC, for example, the gracious register space (windows) ended up with lots of unused regs in simple functions and a cache-busting huge stack size footprint, definitely so if you ever spilled the register ring. Even with all the mov in x86 (and there is always lots of it, at least in compiled C code) to rearrange data to "where it needed to be", it often ended up faster.

When you only look at the callee code (code generated for a given function signature), it's tempting to say "oh it'll definitely be fastest if this arg is here and that return there". You don't know the callers though. There's no guarantee the argument marshalling will end up "pass through" or the returns are "hot" consumed. Say, a struct Point { x: i32, y: i32, z: i32 } as arg/return; if the caller does something like mystruct.deepinside.point[i] = func(mystruct.deepinside.point[i]) in a loop then moving it in/out of regs may be overhead or even prevent vectorisation. But the callee cannot know. Unless... the compiler can see both and inline (back to the C++ excuse). Yes, for function call chaining javascript/rust style it might be nice/useful "in principle". But in practice only if the compiler has enough caller/callee insight to keep the hot working set "passthrough" (no spills).

The lowest hanging fruit on calling is probably to remove the "functions return one primitive thing" that's ingrained in the C ABIs almost everywhere. For the rest ? A lot of benchmarking and code generation statistics. I'd love to see more of that. Even if it's dry stuff.

flohofwoe
2 replies
1d9h

Well, not even that (struct returns ... nope).

C compilers actually pack small struct return values into registers:

https://godbolt.org/z/51q5se86s

It's just limited that on x86-64, GCC and Clang use up to two registers while MSVC only uses one.

Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions that are defined by the various runtime environments (usually the combination of CPU architecture and operating system). C compilers just must adhere to those CPU+OS calling conventions like any other language that wants to interact directly with the operating system.

IMHO the whole performance angle is a bit overblown though, for 'high frequency functions' the compiler should inline the function body anyway. And for situations where that's not possible (e.g. calling into DLLs), the DLL should expose an API that doesn't require such 'high frequency functions' in the first place.

fch42
1 replies
1d7h

Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions [ ... ]

I did not say that. I said "C calling conventions" (plural). Rather aware of the fact that the devil is in the detail here ... heck, if you want it all, back in the bad old days, even the same compiler supported/used multiple ("fastcall" & Co, or on Win 3.x "pascal" for system interfaces, or the various ARM ABIs, ...).

dzaima
0 replies
1d3h

Clang still has some alternative calling conventions via __attribute__((X)) for individual functions with a bunch of options[0], though none just extend the set of arguments passed via GPRs (closest seems to be preserve_none with 12 arguments passed by register, but it also unconditionally gets rid of all callee-saved registers; preserve_most is nice for rarely-taken paths, though until clang-17 it was broken on functions which returned things).

[0]: https://clang.llvm.org/docs/AttributeReference.html#calling-...

dhosek
3 replies
1d15h

I just spent a bunch of time on inspect element trying to figure out how the section headings are set at an angle and (at least with Safari tools), I’m stumped. So how did he do this?

caperfee
0 replies
1d15h

The style is on the `.post-title` element: `transform: skewY(-2deg) translate(-1rem, -0.4rem);`

aaron_seattle2
0 replies
1d15h

h1, h2, h3, h4, h5, h6 { transform:skewY(-2deg) translate(-1rem,0rem); transform-origin:top; font-style:italic; text-decoration-line:underline; text-decoration-color:goldenrod; text-underline-offset:4%; text-decoration-thickness:.25ex }

repelsteeltje
2 replies
1d8h

Can someone explain the “Diana’s silk dress cost $89” mnemonic on x86 reference?

repelsteeltje
0 replies
1d8h

Hah thanks!

Suppose I should have know that one, but never did meaningful assembly programming beyond 6502 and ARM32.

Animats
2 replies
1d17h

Given that the current Rust compiler does aggressive inlining and then optimizes, is this worth the trouble? If the function being called is tiny, it should be inlined. If it's big, you're probably going to spend some time in it and the call overhead is minor.

jonstewart
0 replies
1d17h

Probably? A complex function that’s not a good fit for inlining will probably access memory a few times and those accesses are likely to be the bottlenecks for the function. Passing on the stack squeezes that bottleneck tighter — more cache pressure, load/stores, etc. If Rust can pass arguments optimally in a decent ratio of function calls, not only is it avoiding the several clocks of L1 access, it’s hopefully letting the CPU get to those essential memory bottlenecks faster. There are probably several percentage points of win here…? But I am drinking wine and not doing the math, so…

celeritascelery
0 replies
1d17h

Runtime functions (eg dyn Trait) can’t be inlined for one, so this would help there. But also if you can make calls cheaper then you don’t have to be so aggressive with inlining, which can help with code size and compile times.

AceJohnny2
2 replies
1d18h

In contrast: "How Swift Achieved Dynamic Linking Where Rust Couldn't " (2019) [1]

On the one hand I'm disappointed that Rust still doesn't have a calling convention for Rust-level semantics. On the other hand the above article demonstrates the tremendous amount of work that's required to get there. Apple was deeply motivated to build this as a requirement to make Swift a viable system language that applications could rely on, but Rust does not have that kind of backing.

[1] https://faultlore.com/blah/swift-abi/

HN discussion: https://news.ycombinator.com/item?id=21488415

fl0ki
1 replies
1d18h

It's only fair to point out that Swift's approach has runtime costs. It would be good to have more supported options for this tradeoff in Rust, including but not limited to https://github.com/rust-lang/rfcs/pull/3470

ninkendo
0 replies
1d18h

Notably these runtime costs only occur if you’re calling into another library. For calls within a given swift library, you don’t incur the runtime costs: size checks are elided (since size is known), calls can be inlined, generics are monomorphized… the costs only happen when you’re calling into code that the compiler can’t see.

retox
1 replies
1d18h

Meta: the minimap is quite interesting, it's 'just' a smaller copy of all the content.

edflsafoiewq
0 replies
1d17h

Clever! Should probably have aria-hidden though.

m463
1 replies
1d17h

interesting website - the title text is slanted.

Sometimes people who dig deep into the technical details end up being creative with those details.

eviks
0 replies
1d9h

True, creative, but usually in a quality degrading way like here (slanted text is harder to read, also due to the underline being too thick, and takes more space) or like with those poorly legible bg/fg color combinations

dwattttt
1 replies
1d18h

If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.

If the legacy shim tail calls the Rust-calling-convention function, won't that prevent it from fixing any return value differences in the calling convention?

JonChesterfield
0 replies
1d18h

Yes. People tend to forget about the return half of the calling convention though so it's an understandable typographical error.

jayachandranpm
0 replies
11h44m

good