Don't pass structs bigger than 16 bytes on AMD64

The cost of argument passing is rarely well-understood, and I'm glad someone wrote this. People routinely pass 24-byte objects by value in places like Google, and the cost of that practice doesn't show up on a profiler because it's spread out on every function.

"in places like Google".

Are you speaking from experience?

As a former googler, I can say with certainty that the guidelines for passing any non primitive is pointer or ref.

string_view might be the only exception I can think of.

I saw std::function and std::string (e.g. TotW 127, https://abseil.io/tips/117) being passed by value a lot in newer google3 code. Both are larger than 16 bytes.

This is done so you can use std::move to take ownership of the allocated memory in these objects rather than do a new allocation. Passing by value rather than rvalue reference let's your function be more flexible at the call site. You can pass an rvalue/move or just make a copy at the call site which means the caller (who actually knows if a copy or a move is more appropriate) gets to control how the memory gets allocated.

An unnecessary memory allocation is much more of a performance hit than suboptimal calling convention.

In that case, the optimal interface should take std::string&& no? But it's awkward.

Wouldn't this be very annoying to work with, because now you have to explicitly move or copy the string whenever you want to construct one of these objects?

The function could accept an universal reference instead of an rvalue reference, this avoids the dance the caller has to do to pass a copy.

IMO it's hard to beat pass by value considering both performance and cognitive load.

Yeah, making it accept a universal reference would fix it...

...but that requires the argument to be a type from a template D: so you'd have to write:

    template<typename String = std::string>
    void dosomething(String &&str)

...and that's not quite right either, since you'd want it to be either an rvalue reference or a const lvalue reference

Explicity having to copy or move is a desired coding style IMO.

It's kinda what Rust forces you to do, except that std::move is implied. Anything taken by value is equivalent to taking by && unless the type is explicitly marked as Copy (i.e. it can be trivially copied and the copies are implicit).

But yeah, in a c++ codebase, good modern practices are often verbose and clunky.

The vanilla interface often allows what's right depending on the context (e.g., if an unnamed return value is passed in, it can be move constructed, otherwise it can be copied). I saw some code bases provide an override for a const ref and a normal type.

Google builds non-PIE, non-PIC, static, profile-guided, link-time-optimized, and post-link-optimized binaries and probably DGAF about calling conventions.

I have seen the assembly output of Google code, and I will say that my previous comment still stands.

Passing std::function by value is almost definitely wrong these days with absl::AnyInvocable (if you need to store the type) and absl::FunctionRef (if you don't). Rough analogues in the standard are std::move_only_function (C++23) and std::function_ref (C++26).

std::string in the case you cited is only really relevant if you std::move() into it and you would otherwise incur a string copy. Yes, it's bigger than 16 bytes (24 bytes), but that pales in comparison to the alternative.

(Taking std::string&& would eliminate the possibility of misuse / accidental copies, but that pattern is generally discouraged by Google's style guide for various reasons.)

Also, just because you see a certain pattern an awful lot even at Google doesn't mean that it's best practice -- there are plenty of instances of protobufs being passed by value...

not trivially copyable types are never passed in registers regardless of size

Please prove me wrong on these points. My current belief and understanding is that:

1. `foo(T)` indicates polymorphic over both `T::T(const T&)` and `T::T(T&&)`. This gives you benefits of both pass-by-move (using `std::move` as needed), or copy.

2. Usage of `foo(T&&)` signals code-smell or an anti-pattern as `foo(T)` should be used instead unless it is perfecting forwarding / universal reference `template <typename T> foo(T&&)`.

Yes. A lot of 24-byte structs at Google are passed by value.

Indeed they are. I read about it from hackernews.

I learned about it when I was at Google working on performance.

0o...

string_view might be the only exception I can think of.

What about std::span? std::optional? etc.

std::span should apply, but std::optional is at least as large as the type it wraps.

Almost always 8 + the size of the type. Since it's a one byte book followed by 7 padding bytes, followed by the type itself.

View types like span, string_view, FunctionRef: absolutely pass by value. They're designed to be cheap to copy, and they provide type erasure which can make your API more flexible.

You also usually want to pass smart pointers (unique_ptr, shared_ptr) by value to avoid breaking your ownership model, or pass by raw pointer or reference to the held data. The obvious case that comes to mind where you'd want to write const unique_ptr<T>& is when iterating over a vector<unique_ptr<T>>. Otherwise, pass ptr.get() for an optional param, or *ptr for a required param where you've already checked ptr != nullptr.

Wrapper types like optional, variant, StatusOr? Usually pass by reference, unless the pointed-to type is small (<= 8 bytes).

One common pattern where I see wide structs being passed by value is the use of option structs (https://abseil.io/tips/173) that are usually constructed via designated initializers. There is some marginal benefit to being able to std::move() out individual fields but it's not really a big deal either way, as said computation is usually only done once on program initialization.

Passing larger objects by value is allowed or even encouraged if it lets you avoid a copy (e.g. if you’re about to move the object)

Pass by value / pass by ref is quite a bit of mental overhead as it effectively affects your ABI/API. Zig tries to not force this so as long as you "pass by value", the compiler can actually decide to pass it by reference. It does expose this kind of footgun though https://github.com/ziglang/zig/issues/5973#issuecomment-1330...

Oof that's a very nasty bug. Is this still relevant in Zig or is there a workaround in the language? I'm not familiar with Zig, heard some good things about it, but this looks like a showstopper.

still relevant today, but as someone who writes a lot of Zig code I haven't ever really encountered it in the wild. You definitely could, though, and it'd be wildly confusing.

Luckily, the Zig core team has recognized it is an issue and plan to address it before 1.0 :)

It's rare to hit it, but if you do, having it happen silently is not ideal for sure.

I still think noalias-by-default is the way to fix this.

https://github.com/ziglang/zig/issues/1108

You get all the benefits of Zig being able to choose the function ABI, but if the optimization would have caused a bug, you'll get an immediate panic at the function entrypoint, instead of silently corrupted data.

The issue is still Open.

It’s a bit confusing when compared with other programming languages. But the Zig docs’ introduction to structs are pretty clear that they can be pass-by-value or pass-by-reference at the compiler’s discretion. If you need a copy, make a copy; if you need a reference, pass a pointer; if you don’t care, the compiler will pick one.

I think this is the same in the still experimental Carbon

There is also a another paradigm where parameters are marked as in, out or inout parameters as in D (?) and cpp2 which makes the intent clear

I think this is the same in the still experimental Carbon

Apparently, it's not (anymore?):

https://github.com/ziglang/zig/issues/5973#issuecomment-1801...

Does the C++ ABI also spill 24-byte objects to the stack for each call? I guess I don't expect std::string or std::function parameters to be fast but it's still surprising.

Spilling big objects up the call stack is often not nearly as bad. Most of the time this happens, the object being spilled is constructed during the function, so the copy on return is actually elided.

That’s only true if the function takes no argument. Otherwise the copy on return is unavoidable.

That's not actually true: https://en.cppreference.com/w/cpp/language/copy_elision

I think parent is confused with NRVO typically not being triggered when returning a function argument.

In AMD64 ABI, anything larger (that doesn't fit in register) will be returned by basically output pointer argument. Caller reserves space for the return value and gives pointer to the function which then fills it with the return value. Both sides are easy to optimize and no unnecessary copies are made. Function can directly construct return value to the given address and caller can directly give a pointer to a variable if function return value is used to initialise variable.

The tradeoff here is that if you pass a pointer to that 24-byte object instead, when you actually need to use the object you need to deref that pointer. But nothing guarantees that the pointed object is nearby! You could very well cause a cache miss and wait a long time (like 100 nanoseconds) to fetch that 24-byte object from main memory.

If that same object is directly passed it's just on the stack. So it's most likely in the cache.

Yes, but this concern is orthogonal to calling convention. If the parent function does anything to the object, it will be in cache already. If it does not, but loads the fields to register, the parent will take the cache miss instead of the function. You neither gain or lose anything...

It's usually not about a single parent function and a called function. It's usually a long chain of calls where functions keep passing them around, from the point of construction. Imagine some function first creates that 24-byte object, then pass it around by pointer. Ten function calls later dereferencing that pointer will need access to main memory. Now imagine that first function passes the object by value; ten function calls later perhaps the contents of the object is duplicated on the stack ten times, but there's no cache miss.

It's a tradeoff between reducing memory usage and reducing cache misses.

+1 for pointing out that profiling often fails entirely to find widely distributed overhead, like that built into calling conventions.

The issue here is the SysV amd64 ABI. You could also just make your language-internal ABI not be SysV? As long as these aren't exposed to SysV C callers, you can use any calling convention you want.

https://llvm.org/docs/LangRef.html#calling-conventions

For those curious, the relevant diff in neatlang is: https://github.com/Neat-Lang/neat/commit/f4ba38cefc1e26631a5.... It looks much more involved than changing the emitted LLVM calling conventions. Possibly the author wants these types exposed with some deterministic calling convention to C programs.

Or ABIs in general, really.

As any Asm programmer can tell you, this is one of the low-hanging fruits that compilers can easily be beaten at --- don't blindly follow convention, do what makes the most sense in a specific scenario.

indeed. another fun thing that c-like compilers generally don't do is multiple entry points in a function.

consider a pair of functions foo and foo0 that differ only in that the former performs an additional action on its argument -- perhaps refcount adjustment or type conversion.

you can then do (in a register-based abi like amd64 sysv);

  foo:
  do the action ;fallthrough
  foo0:
  rest of the function follows

and have essentially two functions for the price of one.

Won't tail call optimization mostly mitigate this?

Rarely if ever. This isn't about tail-calling, it's about function placement in the final image to enable branch elimination. There's no call between foo and foo0, but many of the requirements for a tail call are also required here

You can structure the code so foo calls foo0, but the compiler and linker have to work together to pull that off and I don't think GCC and clang do so.

If the functions are being built into split sections, generally no.

If they aren't, and foo isn't called nd foo is, I've never seen a tool chain remove just foo but not foo0, which is nice to have.

I think tail call optimization is specifically for this. it will let a function jump to another function rather than returning and then calling it. isn't that what's essentially described here? (honest question - i am always doubting my sanity looking at this stuff :DD)

I can imagine if higher level code isn't within a specific pattern, the compiler might struggle to recognize an opportunity for optimizing the code and skip it. - the higher level programmer could potentially arrange code in ways the optimizers better recognize.

it will let a function jump to another function ... isn't that what's essentially described here?

Not exactly, they're describing doing it without a jump - the first function simply ends at the start of the second function, so the CPU starts running the second function directly after the first with no jump necessary.

Edit: If you're saying a tail call could enable such an optimization, you're right, but it still requires placing the functions in the right spots to eliminate the jump entirely, which is hard.

Not sure how often it's worth the effort though. Unconditional branches while not free are not quite the same performance trap as conditional branches

Inlining is what can mostly mitigate this. You'd write:

  foo() {
    // do something
    return foo0();
  }
  foo0() {
    // do the rest of the things
  }

If you can convince the compiler to inline foo0 into foo, then you get almost what you want. The compiler technically could even use the same code for both (saving some binary size and thus RAM, thus getting the exact same result), though AFAIK this sort of optimization is unusual.

Not inlining, but what I've seen in the real world is that the function foo ends with something like a "jmp foo0" so everything is good. (This jmp is almost free with 100% branch prediction.) No need to inline. Just do a proper tail call optimization. Without symbols you can't tell whether they are two functions or just basic blocks in a single function.

Yeah it should.

Even more fun is using jmp to chain functions together -- only works when you don't have locals and have complete knowledge of your registers!

CALL and RET are just JMPs that automatically keep track of return locations using a stack. But...

--You don't have to use a stack. A stack is a convenient data structure for this purpose, but there are other ways to do it.

--You don't even need to keep track of return points at all. That's what continuations are about: Nothing ever returns.

https://en.wikipedia.org/wiki/COMEFROM

Hah, that’s neat.

I was at a place that had a convention to do

    Function blah(…)
    Argument validation
    Call realBlah(…)

That way internally we could just call realBlah if we needed the functionality of blah without the validation. So I guess we could have used this all over the place.

I guess (although could be wrong, this is the first time I’ve seen it) this strategy is basically incompatible with inlining, though?

why would it?

recursive inlining

There’s apparently some deep and obvious connection between recursion and multiple entry points, but I don’t know what it is.

Fun fact: Fortran has an ENTRY statement specifically for this purpose (obsoleted in the 2008 version).

This is also why “entry” was a reserved word in the original C language. (So was “fortran”.)

Yeah, for the little programs I write on AVR-8 it drives me nuts how much meaningless activity (moving the stack pointer around) that the C compiler does compared to assembly. For a PoV display engine, for instance, you might be able to reserve a few registers for the interrupt handler and still keep your inner loop variables entirely in registers.

Link time optimisation is supposed to fix that (the meaningless activity). Especially on a bare metal build where you can be certain what needs to call your function. But it wouldn't surprise me if it didn't always.

Unlikely to be able to reserve a register though

first question that came to my mind you answered :) thanks!. i think its interesting a lot of stuff adheres to such ABIs etc. especially since they were conceived of quite a time ago, and often lean towards being compatible with older CPUs, where newer ones with more extended registers etc. might have features that could be used to improve this without making the structs shorter. I guess it's not super interesting to make software for specific hardware or hardware classes / generations as it'll be unusable on some machines, but having compilers etc. that _can_ produce it might be cool if you want to super optimise the code running on your system towards your systems's hardware features.

This makes me super happy that my C array types are only 16 bytes.

Where do you put the capacity?

My array types are not meant to be changed, so no capacity is needed. I have separate resizable ones.

When you say "array types" it seems like people expected you meant a growable array (C++ std::vector, Rust's Vec, or say Java's ArrayList)

But if you meant an actual array as in Rust's [T; N] then it's weird to talk about them as if they've got some specific size, their size is a parameter (N) of the type.

The size of Rust's [u8; 16] or C's unsigned char [16] is 16 bytes but like, duh. And there's no magic here, [u8; 24] or unsigned char [24] is 24 bytes.

In my code, my array types are a pointer and a length, regardless of the length of the array.

This is so I can use regular C pointers to pass them to system functions that expect pointers. Because C still uses just pointers. But I have the length for bounds checking in my own code.

But my resizable types are much bigger. Probably 32 bytes because they store a destructor too. I pass them by pointer because plain pointers are almost always just one item, meaning no bounds checking is necessary.

Yes, that does mean the actual array is two indirections away, but that style gives me a lot of safety because araay indexing is a code smell.

This seems like a lot of lost optimization opportunities compared to a language which gets this right out of the box.

Also in most cases people's destructors don't have associated local state, so, they needn't take up space in each object. All C objects have non-zero size, so if you have an object representing the destructor even if it has no state that takes up space. In C++ there's a hack to avoid paying this price, but in C there is not.

This seems like a lot of lost optimization opportunities compared to a language which gets this right out of the box.

Well, yeah, but I hate all other languages [1], and I'm willing to pay the price in C.

That same sort of thing allowed me to implement RAII in C, though.

[1]: https://gavinhoward.com/2023/02/why-i-use-c-when-i-believe-i...

  struct Array{void *ptr; uint64_t cap;}
  sizeof(struct Array) //16

I suspect the OP expected something like:

  struct Array{void *ptr; uint64_t len; uint64_t cap;}
  sizeof(struct Array) //24

At any rate, a simple way to get to 16 bytes is the following:

  struct Array{void *ptr; uint32_t len; uint32_t cap;}

How often does one really need a general array with more than 4B elements?

You are right. My reply was a bit silly.

Now thanks to alignment you could use spare bits in the pointer to afford bigger lengths.

  struct Array{int64_t ptr:60; uint64_t len:34, cap:34;}

1) Maybe you could store the bitwise-or of size and capacity on top of each other. If you restrict capacity to a power of two

2) you could store the capacity before the array payload

I'm a little bit confused on what this means.

By array I guess you mean a dynamically sized, but not dynamically-resizable pointer and size/length pair? So you aren't storing the "capacity" like Rust's Vec would be doing.

You can do the same thing in Rust with Box<[T]> if I am understanding you correctly. Box<[T]> in this specific case is a fat pointer, which is a pointer to the underlying allocation and a length.

One of the issues with Box though, at least as of the last time I looked at it, is that the only way you could create a Box<[T]> without unsafe or nightly was to create a Vec, and then call into_boxed_slice on it. The conversion from Vec to Box actually causes a new allocation to be created if the size and capacity fields in the Vec are not equal. In C it would be possible to reuse the Vec's buffer, but dealloc (and all other alloc related functions it seems) in Rust requires passing in information about the layout the of the underlying allocation, and size is part of the layout!

In C++ I guess you would want to use std::unique_ptr<T[]>, and manually store the length, which is not great but still works. I'm not sure if unique_ptr is guaranteed to be the size of a pointer or not, so this may or may not work.

Regular statically-sized arrays are the same size in each language ofc. In Rust and C++ statically-sized arrays do have one benefit though, you get bounds checking on them "for free". Since the length is baked into the type, the accessor methods know the bounds at compile time, so no runtime length information is required to do runtime bounds checking! You can do this in C by hand for each array you define but that is impractical, you could probably use a macro to do this though.

If I did understand what you meant, I think this brings up an interesting topic. Having a first-class dynamically-sized non-resizable array type can be pretty useful! Rust already can do this awkwardly with Box, but C++ currently doesn't have a way to do this without manually storing the length and doing manual bounds checking that I know of. It would not be terribly difficult to implement this and there probably are libraries that exist for it, but I still thing it's interesting.

# comment for Vec::into_boxed_slice https://doc.rust-lang.org/src/alloc/vec/mod.rs.html#1075

# rust std::alloc::dealloc requires std::alloc::Layout https://doc.rust-lang.org/std/alloc/fn.dealloc.html

I meant static arrays.

I have to use a pointer+length pair because many system functions in C require a pointer, but I want bounds for bounds checking.

Yes, I do my own arrays in C.

I also have dynamic arrays, but those are not 16 bytes, and I treat them differently.

Do you have your own array library that you use in your projects? If so that's pretty cool and it would be interesting to see.

Edit: i looked at your profiles information and have discovered lots of fun stuff

Thank you!

As you have probably discovered, my array stuff is just part of a monorepo. But yes. :)

Edit: The code at the Yzena one is not up to date because I've had to keep commits local. But I have 1200+ commits since, and I plan to make them public in April.

This will still spill on the stack when compiling with MSVC (the cutoff size is 8 bytes there).

Oh, thank you for letting me know.

Yeah, it looks like this would be problem for Rust’s `String` (owned) although not `&str` (borrowed).

Inside Rust it doesn't matter, because Rust gets to pick its own ABI rules.

So this only matters for the FFI case and it's probably usually a bad idea to give a mutable String (as opposed to a read-only &str) to foreign language code.

Likewise for Vec<T> and &[T] indeed underneath Rust's String is literally Vec<u8> and &str is literally &[u8] but in both cases with the explicit requirement that the bytes are valid UTF-8 text.

I classify this kind of article as "just enough knowledge to be a pain in the ass". Even if you compile it separately per the instructions, trying to force the compiler to generate ABI-callable functions, you can still undo this error with LTO. Building this program with LTO is dramatically faster in both modes compared to either mode of the non-LTO program.

If your program is performance-sensitive, profile it, peak-optimize it, and only then commit jackassery such as unpacking structs into arguments.

Plus the guy is talking about the supreme importance of performance, but is using reference counting.

(Author) The refcounting version is about as slow as the garbage collected D version on the benchmark. Refcounting operations can often be elided if you optimize a bit.

And this is not a "supreme importance of performance" microoptimization. This single change moved me from place 23 to place 5. That's a 2x speedup!

You won't see the problem with refcounting in microbenchmarks. It manifests when you have multiple threads accessing single pieces of data, or are under memory pressure. Refcounting turns reads into writes, unless steps are taken to avoid that, and/or you store the refcounts separately from the data.

Multiple threads accessing single pieces of data is always slow. The standard answer is "don't do that"; you can do that just as well with GC or manual management.

Confirming you have no idea what you're talking about.

It only changes your rank in this benchmark because the benchmark fails to use whole program optimization, ie as a benchmark it is meaningless.

Whopr doesn't fix the issue, because runtime calls (happen a lot with arrays) shouldn't be inlined; all that achieves is blowing out your instruction cache.

That would be good advice except I’ve yet to see a compiler that can make this kind of stuff visible. First it’s distributed throughout your codebase and I’ve not seen any profiler that can show that impact unless you’re lucky and this becomes a hotspot. This applies to nearly all compiler generated code. Valgrind can measure it (sampling profiler probably can’t) but there’s no tooling to highlight diffuse code gen problems

This is clearly a hot path and not inlining a hot path call is probably the first thing any engineer worth his salt would attempt to fix. If and how it can be fixed is a different question.

Just because something is in the hot path does not mean a profiler will be able to highlight this as a hotspot.

In the provided example, you can fix it without affecting any caller by changing the parameter types from “struct Vector” to “const struct Vector &” pass-by-reference. A lot C++ code that I have seen where pointer bugs existed needlessly used pointers even when passing by reference could have worked while being easier and safer to use.

That will still require the compiler to serialize the three registers to the stack, to be able to pass the pointer to the structure to the callee. It seems like the described benefit is avoiding any serialization from registers to stack, which cannot be avoided with pass-by-reference.

That will still require the compiler to serialize the three registers to the stack, to be able to pass the pointer to the structure to the callee.

why can't it simply pass a pointer to the struct (it's probably already on the stack) without rewriting the struct to the stack? isn't that what a reference is?

But it often isn't in the stack. This is a vector type so it is often modified and used as part of math operations. Each vector field is probably in some register because it was used to calculate something and then has to be stored back to stack to get valid data for the reference.

if current values of source-code-struct fields are in registers, there are two options, that the calling function and struct is so small and so compiler optimized that there is no memory allocation for the struct, or there is an allocation and it's just dirty and not updated. Which means update it and call, or spill and call.

You want to call a function that is not expecting its arguments to be in registers, and you don't have unlimited registers on this hardware at this time, so I don't understand all the hand-wringing about either option. I guess what I'm saying is that this is all being treated like "because we assume optimization and we know how optimization works, we're entitled to have what's important in registers all the time so things will go faster, so this must be a bug and we have to fix it."

The actual solution is to inline the callee and rely on the compiler, switch to asm and hand guarantee, or create a new language that has register calling or data flow semantics that are different than what you have now. The conversation that's taking place here sounds to me like relying on undefined behaviors, something we used to do because we knew we could rely on them but you can't any more.

But that would only be possible if you write out all the struct fields you are accessing into function parameters. If the struct is complex with a lot of fields, then you would end up with a messy function signature. Also, I am sure there is a limit to how many parameters you can have before this optimization stops working.

But that would only be possible if you write out all the struct fields you are accessing into function parameters. If the struct is complex with a lot of fields, then you would end up with a messy function signature.

This is the approach the article's author took.

No. Actually that is the whole problem here. That is pretty much exactly what the compiler does thanks to ABI. ABI says that the value must be passed by pointer so it has to store it somewhere to get the pointer, which is exactly the same that would happen if we made that explicit by using const-ref. By changing to use separate arguments for the struct values, arguments can be passed in registers instead.

on the other hand if the function is simple enough that the call overhead is significant, [[force_inline]] is an option.

When I found this issue, it was in code that had twenty or thirty allocas to pass pointers to my byvals. Every function would start with a separate alloca for every parameter that was passed to a call. I always sort of assumed that LLVM would be good at cleaning that sort of thing up. It turns out... no, it isn't.

This was a C99 example, not a C++ one. In many environments, tools cannot be changed arbitrarily because of minimum inertia. If C++ were allowed, then more options are available including moved arguments to reduce copying.

Its common sense: anything passed in registers(preloaded due spec execution) will outperform stack writes and stack wrangling will outperform heap allocated stuff. That why ugly spaghetti code with tons of globals runs blazingly fast and your elegant recursive functions and tuple/struct/lists arguments are incredibly slow, the first one is much easier to optimize into tight assembly loops.

But what about my academic mental exercise. You mean I learned all these tools and techniques for nothing? How will my manager know if I am a top engineer or not if my code is not impenetrable with ridiculous concepts applied irrationally.

Different scale, different problem. Those "top engineers" at Google are dealing with memory that won't even fit in a hard drive (let alone memory), so stuff like this may as well be pico-optimizations.

The logic makes sense for them. The problem is everyone else copying Google without understanding their own data. Any domain where real time performance is important need to put aside the SWE books and pick up a CPU/GPU architecture reference instead.

Half of the money in IT right now is flowing from investors hearing "blahblah AI blahblah" and seeing dollar bills, which in turn makes half of the movement in IT happen in shitty python code. shrugs

Of course, that assumes that your spaghetti code implements the same algorithm as your elegant code.

If the elegant code is O(n) and the spaghetti code is O(n^2) you might notice the difference.

And, of course, there's also maintenance to consider.

In some sense, compiler are there exactly to turn our elegant solutions into spaghetti code.

I think that was implied

"Pass parameters in registers, not on the stack" is "common sense".

"Parameters larger than 16 bytes are always passed on the stack" ... isn't nearly as obvious.

Then again these days some CPUs can rename memory, making stack spills cheaper.

Global objects also hinder compiler optimizations.

When I first transitioned to x64, I was concerned about graphics vec3 objects (3xfloat) expanding to sizeof()=16 bytes (instead of 12) that I benchmarked the hell out of my graphics engines. Unsuprisingly, using 16 bytes ended up being faster than 12, due to aligning 8 byte reads (internal and on GPU). So a vec3 silently became a vec4 (even though a vec4 also exists).

As always, do global, not local benchmarking.

Probably not a lot faster? And apart from what you're doing with this data, isn't this also going to depend a lot on the CPU? I can see that with 16 bytes, a lot of accesses could be 2x aligned 8 bytes (or 1x16) instead of 3x4 bytes. Some other accesses, maybe not so much. And there is also the issue of increased cache pressure.

On the GPU especially, having the data aligned to 16 bytes is huge just due to the nature of what a GPU is optimized for and the way memory access is designed. The speedup on x86_64 isn't as huge, but the aligned load instructions for SSE/AVX registers are still quite a bit faster than the unaligned load instructions. It can really add up when you're processing thousands of vectors every frame. The cache lines will only go so far when a single 3D affine transformation matrix (floatx4x4) already takes up an entire cache line by itself.

It also has the very nice side effect of aligning with SSE sizes, so you can use \_mm\_load\_ps directly, which makes the code neater and us super easy to vectorise

x64 ABIs are also significantly better than x86 ones as well.

The article links to a presumably extremely specialised benchmark[0] with some surprising results. To me as a simple dev, at least:

Java (JIT) beats C++ and even Scala.

What is Julia HO and why is it so ridiculously fast?

The speed difference between Python and Pypy is massive. Is there any reason not to use Pypy? Shouldn't it be the standard?

[0] https://github.com/jinyus/related_post_gen/

pypy and cpython don't mix well with each other so you have to create a whole new universe of packages for pypy. Interop with c libraries also adds complexity to the transition.

In addition, vanilla python is finally starting to tackle the worst performance problems (like the GIL getting removed for instance), so they might catch up to pypy eventually. There's also several different JITs installable into cpython systems these days too which is less painful than switching to pypy.

Personally, I am saddened pypy didn't replace standard python, but even in my own career I couldn't find a big speed difference in practice because all the heavy lifting utilizes C, C++, or fortran libraries wrapped by thin python apis.

Pypy doesn't work with all libraries (as well as certain function calls in std), and since that's one of the biggest draws of python, it makes it a non-starter much of the time.

Looks like the HO means hand optimized, with special datastructures for this benchmark.

see: https://github.com/jinyus/related_post_gen/#user-content-fn-...

That means that arrays, in addition to start and end, also need a pointer to the base of the array object, where the reference count is stored.

Isn’t the size and layout of the structure known at compile time? Why couldn’t the ref count be found by subtraction rather than by being passed?

The standard way to implement this is to use address - 1 for such metadata, but any offset in a contiguous allocation works.

The refcount is at slice.base[-1] and the array spans slice.start to slice.end

So shouldn’t this be known at compile time? (I’m not familiar with the language in question)

As a .NET developer I always pay attention to passing by value vs passing by reference.

There is extra complexity there depending on the call path. Given the premise of the article, you shouldn’t pass structs larger than two object references, or larger dan a Int64 (long). You already get to that point quickly. But in .NET there are also things like allocations to worry about.

How does such issues look in profiler? Is there a robust method to detect performance regression due to arg sizes?

it's perfectly fine to pass and return structs bigger than 16 bytes by value on amd64 even with the sysv amd64 abi; it's just slow. but it's often worth it to make your code clearer. not in this case, of course, but as loeg points out, inside your own language you can just use a custom abi, as for example each c++ compiler, golang, ocaml, and sbcl do

The rule of thumb I've always heard for C++ is that anything non-primitive should be passed as a reference (or a pointer if you really have to) unless you have a good reason to pass by value (and normally you won't). Partly because of the ABI, partly to avoid copy or move constructors.

It's a tedious low level detail, but that's what you have to pay attention to in C++ if you want the best performance. And to be clear, it is just a performance optimization - the struct passing code will work fine, just not as fast.

FWIW, on MSVC the cutoff size is 8 bytes before structs are passed on the stack. It's an ABI detail that shouldn't be relied upon in portable code.

...but also don't stress it too much for functions that are not called frequently - and for frequently called small functions like in the example, just make sure that the compiler can inline the code (for instance via LTO) this will unlock much more useful optimizations beyond just passing args in registers.

On Windows with the default cdecl calling convention structs bigger than 8 bytes don’t get passed in registers [1].

[1]: https://learn.microsoft.com/en-us/cpp/build/x64-calling-conv...

Such a subtle issue! And I lay the blame on our obsession with abstraction.

Once we give up responsibility of the machine architecture to a compiler, we then have to chase issues like this. What gestures can I make to my compiler, to make it use better machine operations? It's maddening. Like a game of charades.

Sure abstractions gain us tremendously, are an invaluable aid to quickly conceiving of an application and creating something safe and understandable and even beautiful.

But we lost all hope of making it fast, or small, or 'efficient' in any physical way.

My favorite example from my lifetime: On the Dell Triton our app programmers had a main loop that included a call using complex types. The entry point had, and I disassembled to see, five constructors and destructors called to coerce the types into something the API needed.

It consumed the entirety of the embedded processor hyper-thread. They were unconcerned, as it worked as designed.

I added a little casting in the call arguments, which reduced the CPU time to something like 15%.