return to table of content

Don't pass structs bigger than 16 bytes on AMD64

pclmulqdq
43 replies
1d13h

The cost of argument passing is rarely well-understood, and I'm glad someone wrote this. People routinely pass 24-byte objects by value in places like Google, and the cost of that practice doesn't show up on a profiler because it's spread out on every function.

elromulous
23 replies
1d13h

"in places like Google".

Are you speaking from experience?

As a former googler, I can say with certainty that the guidelines for passing any non primitive is pointer or ref.

string_view might be the only exception I can think of.

e____g
13 replies
1d13h

I saw std::function and std::string (e.g. TotW 127, https://abseil.io/tips/117) being passed by value a lot in newer google3 code. Both are larger than 16 bytes.

eco
7 replies
1d12h

This is done so you can use std::move to take ownership of the allocated memory in these objects rather than do a new allocation. Passing by value rather than rvalue reference let's your function be more flexible at the call site. You can pass an rvalue/move or just make a copy at the call site which means the caller (who actually knows if a copy or a move is more appropriate) gets to control how the memory gets allocated.

An unnecessary memory allocation is much more of a performance hit than suboptimal calling convention.

jstimpfle
6 replies
1d9h

In that case, the optimal interface should take std::string&& no? But it's awkward.

andersa
4 replies
1d9h

Wouldn't this be very annoying to work with, because now you have to explicitly move or copy the string whenever you want to construct one of these objects?

ismailmaj
1 replies
1d5h

The function could accept an universal reference instead of an rvalue reference, this avoids the dance the caller has to do to pass a copy.

IMO it's hard to beat pass by value considering both performance and cognitive load.

mort96
0 replies
1d4h

Yeah, making it accept a universal reference would fix it...

...but that requires the argument to be a type from a template D: so you'd have to write:

    template<typename String = std::string>
    void dosomething(String &&str)
...and that's not quite right either, since you'd want it to be either an rvalue reference or a const lvalue reference

secondcoming
0 replies
1d4h

Explicity having to copy or move is a desired coding style IMO.

nitnelave
0 replies
1d8h

It's kinda what Rust forces you to do, except that std::move is implied. Anything taken by value is equivalent to taking by && unless the type is explicitly marked as Copy (i.e. it can be trivially copied and the copies are implicit).

But yeah, in a c++ codebase, good modern practices are often verbose and clunky.

wbkang
0 replies
2h12m

The vanilla interface often allows what's right depending on the context (e.g., if an unnamed return value is passed in, it can be move constructed, otherwise it can be copied). I saw some code bases provide an override for a const ref and a normal type.

jeffbee
1 replies
1d12h

Google builds non-PIE, non-PIC, static, profile-guided, link-time-optimized, and post-link-optimized binaries and probably DGAF about calling conventions.

pclmulqdq
0 replies
1d12h

I have seen the assembly output of Google code, and I will say that my previous comment still stands.

vitus
0 replies
1d12h

Passing std::function by value is almost definitely wrong these days with absl::AnyInvocable (if you need to store the type) and absl::FunctionRef (if you don't). Rough analogues in the standard are std::move_only_function (C++23) and std::function_ref (C++26).

std::string in the case you cited is only really relevant if you std::move() into it and you would otherwise incur a string copy. Yes, it's bigger than 16 bytes (24 bytes), but that pales in comparison to the alternative.

(Taking std::string&& would eliminate the possibility of misuse / accidental copies, but that pattern is generally discouraged by Google's style guide for various reasons.)

Also, just because you see a certain pattern an awful lot even at Google doesn't mean that it's best practice -- there are plenty of instances of protobufs being passed by value...

leni536
0 replies
1d10h

not trivially copyable types are never passed in registers regardless of size

jtasdlfj234
0 replies
1d1h

Please prove me wrong on these points. My current belief and understanding is that:

1. `foo(T)` indicates polymorphic over both `T::T(const T&)` and `T::T(T&&)`. This gives you benefits of both pass-by-move (using `std::move` as needed), or copy.

2. Usage of `foo(T&&)` signals code-smell or an anti-pattern as `foo(T)` should be used instead unless it is perfecting forwarding / universal reference `template <typename T> foo(T&&)`.

pclmulqdq
3 replies
1d12h

Yes. A lot of 24-byte structs at Google are passed by value.

japanman185
2 replies
1d10h

Indeed they are. I read about it from hackernews.

pclmulqdq
0 replies
1d2h

I learned about it when I was at Google working on performance.

ductsurprise
0 replies
1d8h

0o...

dataflow
3 replies
1d10h

string_view might be the only exception I can think of.

What about std::span? std::optional? etc.

nemetroid
1 replies
1d8h

std::span should apply, but std::optional is at least as large as the type it wraps.

fyrn_
0 replies
1d7h

Almost always 8 + the size of the type. Since it's a one byte book followed by 7 padding bytes, followed by the type itself.

vitus
0 replies
1d

View types like span, string_view, FunctionRef: absolutely pass by value. They're designed to be cheap to copy, and they provide type erasure which can make your API more flexible.

You also usually want to pass smart pointers (unique_ptr, shared_ptr) by value to avoid breaking your ownership model, or pass by raw pointer or reference to the held data. The obvious case that comes to mind where you'd want to write const unique_ptr<T>& is when iterating over a vector<unique_ptr<T>>. Otherwise, pass ptr.get() for an optional param, or *ptr for a required param where you've already checked ptr != nullptr.

Wrapper types like optional, variant, StatusOr? Usually pass by reference, unless the pointed-to type is small (<= 8 bytes).

One common pattern where I see wide structs being passed by value is the use of option structs (https://abseil.io/tips/173) that are usually constructed via designated initializers. There is some marginal benefit to being able to std::move() out individual fields but it's not really a big deal either way, as said computation is usually only done once on program initialization.

medler
0 replies
23h54m

Passing larger objects by value is allowed or even encouraged if it lets you avoid a copy (e.g. if you’re about to move the object)

Cloudef
8 replies
1d11h

Pass by value / pass by ref is quite a bit of mental overhead as it effectively affects your ABI/API. Zig tries to not force this so as long as you "pass by value", the compiler can actually decide to pass it by reference. It does expose this kind of footgun though https://github.com/ziglang/zig/issues/5973#issuecomment-1330...

gnulinux
5 replies
1d11h

Oof that's a very nasty bug. Is this still relevant in Zig or is there a workaround in the language? I'm not familiar with Zig, heard some good things about it, but this looks like a showstopper.

slimsag
2 replies
1d10h

still relevant today, but as someone who writes a lot of Zig code I haven't ever really encountered it in the wild. You definitely could, though, and it'd be wildly confusing.

Luckily, the Zig core team has recognized it is an issue and plan to address it before 1.0 :)

Cloudef
1 replies
1d10h

It's rare to hit it, but if you do, having it happen silently is not ideal for sure.

ForkMeOnTinder
0 replies
1d4h

I still think noalias-by-default is the way to fix this.

https://github.com/ziglang/zig/issues/1108

You get all the benefits of Zig being able to choose the function ABI, but if the optimization would have caused a bug, you'll get an immediate panic at the function entrypoint, instead of silently corrupted data.

avgcorrection
0 replies
1d2h

The issue is still Open.

MatthiasPortzel
0 replies
1d2h

It’s a bit confusing when compared with other programming languages. But the Zig docs’ introduction to structs are pretty clear that they can be pass-by-value or pass-by-reference at the compiler’s discretion. If you need a copy, make a copy; if you need a reference, pass a pointer; if you don’t care, the compiler will pick one.

cassepipe
1 replies
1d2h

I think this is the same in the still experimental Carbon

There is also a another paradigm where parameters are marked as in, out or inout parameters as in D (?) and cpp2 which makes the intent clear

codethief
0 replies
14h33m

I think this is the same in the still experimental Carbon

Apparently, it's not (anymore?):

https://github.com/ziglang/zig/issues/5973#issuecomment-1801...

loeg
5 replies
1d13h

Does the C++ ABI also spill 24-byte objects to the stack for each call? I guess I don't expect std::string or std::function parameters to be fast but it's still surprising.

pclmulqdq
4 replies
1d12h

Spilling big objects up the call stack is often not nearly as bad. Most of the time this happens, the object being spilled is constructed during the function, so the copy on return is actually elided.

bugbuddy
3 replies
1d11h

That’s only true if the function takes no argument. Otherwise the copy on return is unavoidable.

pclmulqdq
1 replies
1d11h
gpderetta
0 replies
1d8h

I think parent is confused with NRVO typically not being triggered when returning a function argument.

kukkamario
0 replies
1d9h

In AMD64 ABI, anything larger (that doesn't fit in register) will be returned by basically output pointer argument. Caller reserves space for the return value and gives pointer to the function which then fills it with the return value. Both sides are easy to optimize and no unnecessary copies are made. Function can directly construct return value to the given address and caller can directly give a pointer to a variable if function return value is used to initialise variable.

kccqzy
2 replies
23h26m

The tradeoff here is that if you pass a pointer to that 24-byte object instead, when you actually need to use the object you need to deref that pointer. But nothing guarantees that the pointed object is nearby! You could very well cause a cache miss and wait a long time (like 100 nanoseconds) to fetch that 24-byte object from main memory.

If that same object is directly passed it's just on the stack. So it's most likely in the cache.

Remnant44
1 replies
23h7m

Yes, but this concern is orthogonal to calling convention. If the parent function does anything to the object, it will be in cache already. If it does not, but loads the fields to register, the parent will take the cache miss instead of the function. You neither gain or lose anything...

kccqzy
0 replies
21h31m

It's usually not about a single parent function and a called function. It's usually a long chain of calls where functions keep passing them around, from the point of construction. Imagine some function first creates that 24-byte object, then pass it around by pointer. Ten function calls later dereferencing that pointer will need access to main memory. Now imagine that first function passes the object by value; ten function calls later perhaps the contents of the object is duplicated on the stack ten times, but there's no cache miss.

It's a tradeoff between reducing memory usage and reducing cache misses.

PaulHoule
0 replies
1d4h

+1 for pointing out that profiling often fails entirely to find widely distributed overhead, like that built into calling conventions.

loeg
21 replies
1d13h

The issue here is the SysV amd64 ABI. You could also just make your language-internal ABI not be SysV? As long as these aren't exposed to SysV C callers, you can use any calling convention you want.

https://llvm.org/docs/LangRef.html#calling-conventions

For those curious, the relevant diff in neatlang is: https://github.com/Neat-Lang/neat/commit/f4ba38cefc1e26631a5.... It looks much more involved than changing the emitted LLVM calling conventions. Possibly the author wants these types exposed with some deterministic calling convention to C programs.

userbinator
19 replies
1d13h

Or ABIs in general, really.

As any Asm programmer can tell you, this is one of the low-hanging fruits that compilers can easily be beaten at --- don't blindly follow convention, do what makes the most sense in a specific scenario.

5-
16 replies
1d8h

indeed. another fun thing that c-like compilers generally don't do is multiple entry points in a function.

consider a pair of functions foo and foo0 that differ only in that the former performs an additional action on its argument -- perhaps refcount adjustment or type conversion.

you can then do (in a register-based abi like amd64 sysv);

  foo:
  do the action ;fallthrough
  foo0:
  rest of the function follows
  
and have essentially two functions for the price of one.

zekica
7 replies
1d8h

Won't tail call optimization mostly mitigate this?

rsaxvc
3 replies
1d5h

Rarely if ever. This isn't about tail-calling, it's about function placement in the final image to enable branch elimination. There's no call between foo and foo0, but many of the requirements for a tail call are also required here

You can structure the code so foo calls foo0, but the compiler and linker have to work together to pull that off and I don't think GCC and clang do so.

If the functions are being built into split sections, generally no.

If they aren't, and foo isn't called nd foo is, I've never seen a tool chain remove just foo but not foo0, which is nice to have.

sim7c00
2 replies
1d4h

I think tail call optimization is specifically for this. it will let a function jump to another function rather than returning and then calling it. isn't that what's essentially described here? (honest question - i am always doubting my sanity looking at this stuff :DD)

I can imagine if higher level code isn't within a specific pattern, the compiler might struggle to recognize an opportunity for optimizing the code and skip it. - the higher level programmer could potentially arrange code in ways the optimizers better recognize.

DSMan195276
1 replies
1d3h

it will let a function jump to another function ... isn't that what's essentially described here?

Not exactly, they're describing doing it without a jump - the first function simply ends at the start of the second function, so the CPU starts running the second function directly after the first with no jump necessary.

Edit: If you're saying a tail call could enable such an optimization, you're right, but it still requires placing the functions in the right spots to eliminate the jump entirely, which is hard.

ithkuil
0 replies
1d3h

Not sure how often it's worth the effort though. Unconditional branches while not free are not quite the same performance trap as conditional branches

amalcon
1 replies
1d3h

Inlining is what can mostly mitigate this. You'd write:

  foo() {
    // do something
    return foo0();
  }
  foo0() {
    // do the rest of the things
  }
If you can convince the compiler to inline foo0 into foo, then you get almost what you want. The compiler technically could even use the same code for both (saving some binary size and thus RAM, thus getting the exact same result), though AFAIK this sort of optimization is unusual.

kccqzy
0 replies
23h30m

Not inlining, but what I've seen in the real world is that the function foo ends with something like a "jmp foo0" so everything is good. (This jmp is almost free with 100% branch prediction.) No need to inline. Just do a proper tail call optimization. Without symbols you can't tell whether they are two functions or just basic blocks in a single function.

pertymcpert
0 replies
1d7h

Yeah it should.

vitiral
2 replies
1d2h

Even more fun is using jmp to chain functions together -- only works when you don't have locals and have complete knowledge of your registers!

dreamcompiler
1 replies
12h31m

CALL and RET are just JMPs that automatically keep track of return locations using a stack. But...

--You don't have to use a stack. A stack is a convenient data structure for this purpose, but there are other ways to do it.

--You don't even need to keep track of return points at all. That's what continuations are about: Nothing ever returns.

sitkack
0 replies
7h28m
bee_rider
2 replies
1d3h

Hah, that’s neat.

I was at a place that had a convention to do

    Function blah(…)
    Argument validation
    Call realBlah(…)
That way internally we could just call realBlah if we needed the functionality of blah without the validation. So I guess we could have used this all over the place.

I guess (although could be wrong, this is the first time I’ve seen it) this strategy is basically incompatible with inlining, though?

kitkat_new
1 replies
1d1h

why would it?

recursive inlining

bee_rider
0 replies
21h43m

There’s apparently some deep and obvious connection between recursion and multiple entry points, but I don’t know what it is.

mananaysiempre
1 replies
1d7h

Fun fact: Fortran has an ENTRY statement specifically for this purpose (obsoleted in the 2008 version).

pklausler
0 replies
1d3h

This is also why “entry” was a reserved word in the original C language. (So was “fortran”.)

PaulHoule
1 replies
1d4h

Yeah, for the little programs I write on AVR-8 it drives me nuts how much meaningless activity (moving the stack pointer around) that the C compiler does compared to assembly. For a PoV display engine, for instance, you might be able to reserve a few registers for the interrupt handler and still keep your inner loop variables entirely in registers.

ajb
0 replies
23h1m

Link time optimisation is supposed to fix that (the meaningless activity). Especially on a bare metal build where you can be certain what needs to call your function. But it wouldn't surprise me if it didn't always.

Unlikely to be able to reserve a register though

sim7c00
0 replies
1d4h

first question that came to my mind you answered :) thanks!. i think its interesting a lot of stuff adheres to such ABIs etc. especially since they were conceived of quite a time ago, and often lean towards being compatible with older CPUs, where newer ones with more extended registers etc. might have features that could be used to improve this without making the structs shorter. I guess it's not super interesting to make software for specific hardware or hardware classes / generations as it'll be unusable on some machines, but having compilers etc. that _can_ produce it might be cool if you want to super optimise the code running on your system towards your systems's hardware features.

gavinhoward
18 replies
1d13h

This makes me super happy that my C array types are only 16 bytes.

saagarjha
9 replies
1d8h

Where do you put the capacity?

gavinhoward
4 replies
1d2h

My array types are not meant to be changed, so no capacity is needed. I have separate resizable ones.

tialaramex
3 replies
1d

When you say "array types" it seems like people expected you meant a growable array (C++ std::vector, Rust's Vec, or say Java's ArrayList)

But if you meant an actual array as in Rust's [T; N] then it's weird to talk about them as if they've got some specific size, their size is a parameter (N) of the type.

The size of Rust's [u8; 16] or C's unsigned char [16] is 16 bytes but like, duh. And there's no magic here, [u8; 24] or unsigned char [24] is 24 bytes.

gavinhoward
2 replies
1d

In my code, my array types are a pointer and a length, regardless of the length of the array.

This is so I can use regular C pointers to pass them to system functions that expect pointers. Because C still uses just pointers. But I have the length for bounds checking in my own code.

But my resizable types are much bigger. Probably 32 bytes because they store a destructor too. I pass them by pointer because plain pointers are almost always just one item, meaning no bounds checking is necessary.

Yes, that does mean the actual array is two indirections away, but that style gives me a lot of safety because araay indexing is a code smell.

tialaramex
1 replies
23h23m

This seems like a lot of lost optimization opportunities compared to a language which gets this right out of the box.

Also in most cases people's destructors don't have associated local state, so, they needn't take up space in each object. All C objects have non-zero size, so if you have an object representing the destructor even if it has no state that takes up space. In C++ there's a hack to avoid paying this price, but in C there is not.

gavinhoward
0 replies
22h21m

This seems like a lot of lost optimization opportunities compared to a language which gets this right out of the box.

Well, yeah, but I hate all other languages [1], and I'm willing to pay the price in C.

That same sort of thing allowed me to implement RAII in C, though.

[1]: https://gavinhoward.com/2023/02/why-i-use-c-when-i-believe-i...

alcover
2 replies
1d4h

  struct Array{void *ptr; uint64_t cap;}
  sizeof(struct Array) //16

norir
1 replies
1d2h

I suspect the OP expected something like:

  struct Array{void *ptr; uint64_t len; uint64_t cap;}
  sizeof(struct Array) //24
At any rate, a simple way to get to 16 bytes is the following:

  struct Array{void *ptr; uint32_t len; uint32_t cap;}
How often does one really need a general array with more than 4B elements?

alcover
0 replies
18h57m

You are right. My reply was a bit silly.

Now thanks to alignment you could use spare bits in the pointer to afford bigger lengths.

  struct Array{int64_t ptr:60; uint64_t len:34, cap:34;}

sesuximo
0 replies
10h32m

1) Maybe you could store the bitwise-or of size and capacity on top of each other. If you restrict capacity to a power of two

2) you could store the capacity before the array payload

chlorion
3 replies
22h45m

I'm a little bit confused on what this means.

By array I guess you mean a dynamically sized, but not dynamically-resizable pointer and size/length pair? So you aren't storing the "capacity" like Rust's Vec would be doing.

You can do the same thing in Rust with Box<[T]> if I am understanding you correctly. Box<[T]> in this specific case is a fat pointer, which is a pointer to the underlying allocation and a length.

One of the issues with Box though, at least as of the last time I looked at it, is that the only way you could create a Box<[T]> without unsafe or nightly was to create a Vec, and then call into_boxed_slice on it. The conversion from Vec to Box actually causes a new allocation to be created if the size and capacity fields in the Vec are not equal. In C it would be possible to reuse the Vec's buffer, but dealloc (and all other alloc related functions it seems) in Rust requires passing in information about the layout the of the underlying allocation, and size is part of the layout!

In C++ I guess you would want to use std::unique_ptr<T[]>, and manually store the length, which is not great but still works. I'm not sure if unique_ptr is guaranteed to be the size of a pointer or not, so this may or may not work.

Regular statically-sized arrays are the same size in each language ofc. In Rust and C++ statically-sized arrays do have one benefit though, you get bounds checking on them "for free". Since the length is baked into the type, the accessor methods know the bounds at compile time, so no runtime length information is required to do runtime bounds checking! You can do this in C by hand for each array you define but that is impractical, you could probably use a macro to do this though.

If I did understand what you meant, I think this brings up an interesting topic. Having a first-class dynamically-sized non-resizable array type can be pretty useful! Rust already can do this awkwardly with Box, but C++ currently doesn't have a way to do this without manually storing the length and doing manual bounds checking that I know of. It would not be terribly difficult to implement this and there probably are libraries that exist for it, but I still thing it's interesting.

# comment for Vec::into_boxed_slice https://doc.rust-lang.org/src/alloc/vec/mod.rs.html#1075

# rust std::alloc::dealloc requires std::alloc::Layout https://doc.rust-lang.org/std/alloc/fn.dealloc.html

gavinhoward
2 replies
22h22m

I meant static arrays.

I have to use a pointer+length pair because many system functions in C require a pointer, but I want bounds for bounds checking.

Yes, I do my own arrays in C.

I also have dynamic arrays, but those are not 16 bytes, and I treat them differently.

chlorion
1 replies
20h56m

Do you have your own array library that you use in your projects? If so that's pretty cool and it would be interesting to see.

Edit: i looked at your profiles information and have discovered lots of fun stuff

gavinhoward
0 replies
20h32m

Thank you!

As you have probably discovered, my array stuff is just part of a monorepo. But yes. :)

Edit: The code at the Yzena one is not up to date because I've had to keep commits local. But I have 1200+ commits since, and I plan to make them public in April.

flohofwoe
1 replies
1d8h

This will still spill on the stack when compiling with MSVC (the cutoff size is 8 bytes there).

gavinhoward
0 replies
1d2h

Oh, thank you for letting me know.

avgcorrection
1 replies
1d1h

Yeah, it looks like this would be problem for Rust’s `String` (owned) although not `&str` (borrowed).

tialaramex
0 replies
1d1h

Inside Rust it doesn't matter, because Rust gets to pick its own ABI rules.

So this only matters for the FFI case and it's probably usually a bad idea to give a mutable String (as opposed to a read-only &str) to foreign language code.

Likewise for Vec<T> and &[T] indeed underneath Rust's String is literally Vec<u8> and &str is literally &[u8] but in both cases with the explicit requirement that the bytes are valid UTF-8 text.

jeffbee
10 replies
1d12h

I classify this kind of article as "just enough knowledge to be a pain in the ass". Even if you compile it separately per the instructions, trying to force the compiler to generate ABI-callable functions, you can still undo this error with LTO. Building this program with LTO is dramatically faster in both modes compared to either mode of the non-LTO program.

If your program is performance-sensitive, profile it, peak-optimize it, and only then commit jackassery such as unpacking structs into arguments.

rwmj
6 replies
1d8h

Plus the guy is talking about the supreme importance of performance, but is using reference counting.

FeepingCreature
5 replies
1d5h

(Author) The refcounting version is about as slow as the garbage collected D version on the benchmark. Refcounting operations can often be elided if you optimize a bit.

And this is not a "supreme importance of performance" microoptimization. This single change moved me from place 23 to place 5. That's a 2x speedup!

rwmj
2 replies
1d4h

You won't see the problem with refcounting in microbenchmarks. It manifests when you have multiple threads accessing single pieces of data, or are under memory pressure. Refcounting turns reads into writes, unless steps are taken to avoid that, and/or you store the refcounts separately from the data.

FeepingCreature
1 replies
1d3h

Multiple threads accessing single pieces of data is always slow. The standard answer is "don't do that"; you can do that just as well with GC or manual management.

rwmj
0 replies
9h33m

Confirming you have no idea what you're talking about.

jeffbee
1 replies
1d4h

It only changes your rank in this benchmark because the benchmark fails to use whole program optimization, ie as a benchmark it is meaningless.

FeepingCreature
0 replies
1d3h

Whopr doesn't fix the issue, because runtime calls (happen a lot with arrays) shouldn't be inlined; all that achieves is blowing out your instruction cache.

vlovich123
2 replies
1d11h

That would be good advice except I’ve yet to see a compiler that can make this kind of stuff visible. First it’s distributed throughout your codebase and I’ve not seen any profiler that can show that impact unless you’re lucky and this becomes a hotspot. This applies to nearly all compiler generated code. Valgrind can measure it (sampling profiler probably can’t) but there’s no tooling to highlight diffuse code gen problems

bugbuddy
1 replies
1d11h

This is clearly a hot path and not inlining a hot path call is probably the first thing any engineer worth his salt would attempt to fix. If and how it can be fixed is a different question.

vlovich123
0 replies
1d10h

Just because something is in the hot path does not mean a profiler will be able to highlight this as a hotspot.

bugbuddy
10 replies
1d12h

In the provided example, you can fix it without affecting any caller by changing the parameter types from “struct Vector” to “const struct Vector &” pass-by-reference. A lot C++ code that I have seen where pointer bugs existed needlessly used pointers even when passing by reference could have worked while being easier and safer to use.

giovannibajo1
5 replies
1d12h

That will still require the compiler to serialize the three registers to the stack, to be able to pass the pointer to the structure to the callee. It seems like the described benefit is avoiding any serialization from registers to stack, which cannot be avoided with pass-by-reference.

fsckboy
2 replies
1d9h

That will still require the compiler to serialize the three registers to the stack, to be able to pass the pointer to the structure to the callee.

why can't it simply pass a pointer to the struct (it's probably already on the stack) without rewriting the struct to the stack? isn't that what a reference is?

kukkamario
1 replies
1d9h

But it often isn't in the stack. This is a vector type so it is often modified and used as part of math operations. Each vector field is probably in some register because it was used to calculate something and then has to be stored back to stack to get valid data for the reference.

fsckboy
0 replies
23h3m

if current values of source-code-struct fields are in registers, there are two options, that the calling function and struct is so small and so compiler optimized that there is no memory allocation for the struct, or there is an allocation and it's just dirty and not updated. Which means update it and call, or spill and call.

You want to call a function that is not expecting its arguments to be in registers, and you don't have unlimited registers on this hardware at this time, so I don't understand all the hand-wringing about either option. I guess what I'm saying is that this is all being treated like "because we assume optimization and we know how optimization works, we're entitled to have what's important in registers all the time so things will go faster, so this must be a bug and we have to fix it."

The actual solution is to inline the callee and rely on the compiler, switch to asm and hand guarantee, or create a new language that has register calling or data flow semantics that are different than what you have now. The conversation that's taking place here sounds to me like relying on undefined behaviors, something we used to do because we knew we could rely on them but you can't any more.

bugbuddy
1 replies
1d12h

But that would only be possible if you write out all the struct fields you are accessing into function parameters. If the struct is complex with a lot of fields, then you would end up with a messy function signature. Also, I am sure there is a limit to how many parameters you can have before this optimization stops working.

loeg
0 replies
1d10h

But that would only be possible if you write out all the struct fields you are accessing into function parameters. If the struct is complex with a lot of fields, then you would end up with a messy function signature.

This is the approach the article's author took.

kukkamario
1 replies
1d9h

No. Actually that is the whole problem here. That is pretty much exactly what the compiler does thanks to ABI. ABI says that the value must be passed by pointer so it has to store it somewhere to get the pointer, which is exactly the same that would happen if we made that explicit by using const-ref. By changing to use separate arguments for the struct values, arguments can be passed in registers instead.

gpderetta
0 replies
1d7h

on the other hand if the function is simple enough that the call overhead is significant, [[force_inline]] is an option.

FeepingCreature
0 replies
1d5h

When I found this issue, it was in code that had twenty or thirty allocas to pass pointers to my byvals. Every function would start with a separate alloca for every parameter that was passed to a call. I always sort of assumed that LLVM would be good at cleaning that sort of thing up. It turns out... no, it isn't.

1letterunixname
0 replies
1d5h

This was a C99 example, not a C++ one. In many environments, tools cannot be changed arbitrarily because of minimum inertia. If C++ were allowed, then more options are available including moved arguments to reduce copying.

countWSS
7 replies
1d13h

Its common sense: anything passed in registers(preloaded due spec execution) will outperform stack writes and stack wrangling will outperform heap allocated stuff. That why ugly spaghetti code with tons of globals runs blazingly fast and your elegant recursive functions and tuple/struct/lists arguments are incredibly slow, the first one is much easier to optimize into tight assembly loops.

japanman185
2 replies
1d10h

But what about my academic mental exercise. You mean I learned all these tools and techniques for nothing? How will my manager know if I am a top engineer or not if my code is not impenetrable with ridiculous concepts applied irrationally.

johnnyanmac
0 replies
1d1h

Different scale, different problem. Those "top engineers" at Google are dealing with memory that won't even fit in a hard drive (let alone memory), so stuff like this may as well be pico-optimizations.

The logic makes sense for them. The problem is everyone else copying Google without understanding their own data. Any domain where real time performance is important need to put aside the SWE books and pick up a CPU/GPU architecture reference instead.

Culonavirus
0 replies
1d8h

Half of the money in IT right now is flowing from investors hearing "blahblah AI blahblah" and seeing dollar bills, which in turn makes half of the movement in IT happen in shitty python code. shrugs

eru
1 replies
1d10h

Of course, that assumes that your spaghetti code implements the same algorithm as your elegant code.

If the elegant code is O(n) and the spaghetti code is O(n^2) you might notice the difference.

And, of course, there's also maintenance to consider.

In some sense, compiler are there exactly to turn our elegant solutions into spaghetti code.

vitiral
0 replies
1d2h

I think that was implied

mort96
0 replies
1d4h

"Pass parameters in registers, not on the stack" is "common sense".

"Parameters larger than 16 bytes are always passed on the stack" ... isn't nearly as obvious.

gpderetta
0 replies
1d8h

Then again these days some CPUs can rename memory, making stack spills cheaper.

Global objects also hinder compiler optimizations.

smallstepforman
4 replies
1d10h

When I first transitioned to x64, I was concerned about graphics vec3 objects (3xfloat) expanding to sizeof()=16 bytes (instead of 12) that I benchmarked the hell out of my graphics engines. Unsuprisingly, using 16 bytes ended up being faster than 12, due to aligning 8 byte reads (internal and on GPU). So a vec3 silently became a vec4 (even though a vec4 also exists).

As always, do global, not local benchmarking.

jstimpfle
1 replies
1d9h

Probably not a lot faster? And apart from what you're doing with this data, isn't this also going to depend a lot on the CPU? I can see that with 16 bytes, a lot of accesses could be 2x aligned 8 bytes (or 1x16) instead of 3x4 bytes. Some other accesses, maybe not so much. And there is also the issue of increased cache pressure.

connicpu
0 replies
1d

On the GPU especially, having the data aligned to 16 bytes is huge just due to the nature of what a GPU is optimized for and the way memory access is designed. The speedup on x86_64 isn't as huge, but the aligned load instructions for SSE/AVX registers are still quite a bit faster than the unaligned load instructions. It can really add up when you're processing thousands of vectors every frame. The cache lines will only go so far when a single 3D affine transformation matrix (floatx4x4) already takes up an entire cache line by itself.

maccard
0 replies
1d5h

It also has the very nice side effect of aligning with SSE sizes, so you can use \_mm\_load\_ps directly, which makes the code neater and us super easy to vectorise

gpderetta
0 replies
1d7h

x64 ABIs are also significantly better than x86 ones as well.

mcv
3 replies
1d4h

The article links to a presumably extremely specialised benchmark[0] with some surprising results. To me as a simple dev, at least:

Java (JIT) beats C++ and even Scala.

What is Julia HO and why is it so ridiculously fast?

The speed difference between Python and Pypy is massive. Is there any reason not to use Pypy? Shouldn't it be the standard?

[0] https://github.com/jinyus/related_post_gen/

unstruktured
0 replies
1d3h

pypy and cpython don't mix well with each other so you have to create a whole new universe of packages for pypy. Interop with c libraries also adds complexity to the transition.

In addition, vanilla python is finally starting to tackle the worst performance problems (like the GIL getting removed for instance), so they might catch up to pypy eventually. There's also several different JITs installable into cpython systems these days too which is less painful than switching to pypy.

Personally, I am saddened pypy didn't replace standard python, but even in my own career I couldn't find a big speed difference in practice because all the heavy lifting utilizes C, C++, or fortran libraries wrapped by thin python apis.

kortex
0 replies
1d2h

Pypy doesn't work with all libraries (as well as certain function calls in std), and since that's one of the biggest draws of python, it makes it a non-starter much of the time.

enticeing
0 replies
1d3h

Looks like the HO means hand optimized, with special datastructures for this benchmark.

see: https://github.com/jinyus/related_post_gen/#user-content-fn-...

gumby
2 replies
1d7h

That means that arrays, in addition to start and end, also need a pointer to the base of the array object, where the reference count is stored.

Isn’t the size and layout of the structure known at compile time? Why couldn’t the ref count be found by subtraction rather than by being passed?

The standard way to implement this is to use address - 1 for such metadata, but any offset in a contiguous allocation works.

fanf2
1 replies
1d5h

The refcount is at slice.base[-1] and the array spans slice.start to slice.end

gumby
0 replies
17h38m

So shouldn’t this be known at compile time? (I’m not familiar with the language in question)

DeathArrow
1 replies
1d7h

As a .NET developer I always pay attention to passing by value vs passing by reference.

sebazzz
0 replies
9h4m

There is extra complexity there depending on the call path. Given the premise of the article, you shouldn’t pass structs larger than two object references, or larger dan a Int64 (long). You already get to that point quickly. But in .NET there are also things like allocations to worry about.

tdiff
0 replies
17h8m

How does such issues look in profiler? Is there a robust method to detect performance regression due to arg sizes?

kragen
0 replies
1d12h

it's perfectly fine to pass and return structs bigger than 16 bytes by value on amd64 even with the sysv amd64 abi; it's just slow. but it's often worth it to make your code clearer. not in this case, of course, but as loeg points out, inside your own language you can just use a custom abi, as for example each c++ compiler, golang, ocaml, and sbcl do

jlarocco
0 replies
1d

The rule of thumb I've always heard for C++ is that anything non-primitive should be passed as a reference (or a pointer if you really have to) unless you have a good reason to pass by value (and normally you won't). Partly because of the ABI, partly to avoid copy or move constructors.

It's a tedious low level detail, but that's what you have to pay attention to in C++ if you want the best performance. And to be clear, it is just a performance optimization - the struct passing code will work fine, just not as fast.

flohofwoe
0 replies
1d9h

FWIW, on MSVC the cutoff size is 8 bytes before structs are passed on the stack. It's an ABI detail that shouldn't be relied upon in portable code.

...but also don't stress it too much for functions that are not called frequently - and for frequently called small functions like in the example, just make sure that the compiler can inline the code (for instance via LTO) this will unlock much more useful optimizations beyond just passing args in registers.

cmovq
0 replies
1d13h

On Windows with the default cdecl calling convention structs bigger than 8 bytes don’t get passed in registers [1].

[1]: https://learn.microsoft.com/en-us/cpp/build/x64-calling-conv...

JoeAltmaier
0 replies
4h14m

Such a subtle issue! And I lay the blame on our obsession with abstraction.

Once we give up responsibility of the machine architecture to a compiler, we then have to chase issues like this. What gestures can I make to my compiler, to make it use better machine operations? It's maddening. Like a game of charades.

Sure abstractions gain us tremendously, are an invaluable aid to quickly conceiving of an application and creating something safe and understandable and even beautiful.

But we lost all hope of making it fast, or small, or 'efficient' in any physical way.

My favorite example from my lifetime: On the Dell Triton our app programmers had a main loop that included a call using complex types. The entry point had, and I disassembled to see, five constructors and destructors called to coerce the types into something the API needed.

It consumed the entirety of the embedded processor hyper-thread. They were unconcerned, as it worked as designed.

I added a little casting in the call arguments, which reduced the CPU time to something like 15%.