HN comments for: Parsing protobuf at 2+GB/s: how I learned to love tail calls in C (2021)

stabbles

35 replies

1d7h

2024-08-19 10:39:53 UTC

Maybe the example is too simple, but it does not require `__attribute__((musttail))` for good code gen.

Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?

To me it seems like a structure of the form

   <read data>

   if (unlikely(malformed))
     return error();

   <prep more stuff>

   switch (data_type) {
     case x:
       return handle_x();
     case y:
       return handle_y();
   }

generates a nice jump table quite reliably.

OskarS

28 replies

1d7h

2024-08-19 10:48:55 UTC

Obviously compilers have been doing tail call elimination for ever, but for this trick "generates [tail calls] quite reliably" is not enough: you have to GUARANTEE it (or fail compilation), otherwise this structure does not work (it will blow the stack immediately). That's the point of [[musttail]], tail call elimination is required, that's the only choice the compiler has.

stkdump

24 replies

1d7h

2024-08-19 10:56:09 UTC

That means that any such code is not portable across compilers anymore. It is effectively written in a non-standard C dialect, because it requires a language extension to work correctly.

taneq

7 replies

1d6h

2024-08-19 11:40:53 UTC

There's no such thing as "standard C" that you can actually write, due to UB and implementation defined behaviour. There's just (C, compiler version, platform) that defines (if only through the compiler's source code) what will actually happen in any given situation.

perching_aix

5 replies

1d6h

2024-08-19 11:44:30 UTC

so because there are implementation defined behaviors in the standard, language extensions become okay?

flohofwoe

4 replies

1d5h

2024-08-19 13:24:32 UTC

Language extensions are a feature, not a bug. They allow C to evolve and C compilers to compete without requiring committee consensus. Good extensions will eventually be picked up by other compilers, and maybe even find their way into the standard.

perching_aix

2 replies

1d4h

2024-08-19 13:52:25 UTC

sure, i get the joke, i just don't like it. it's the same story as browsers. proprietary extensions in the name of progress because technically it's allowed, but also unimplemented standardized features galore, necessitating polyfill libraries and frequent checking of support matrices.

it's just sprawl with popular support.

gpderetta

1 replies

1d3h

2024-08-19 14:38:33 UTC

It is very different. A web browser is a (almost) fully self contained environment. Same with something like JAVA. On the other hand a standard C or C++ compiler+runtime has (by design) very little features. Anything beyond trivial programs has to reach to platform specific features. Even POSIX is often not enough.

perching_aix

0 replies

1d3h

2024-08-19 14:54:08 UTC

not sure i'm following? i was referring exactly to those "few" language features, e.g.: https://en.cppreference.com/w/cpp/compiler_support

pjmlp

0 replies

11h56m

2024-08-20 06:34:38 UTC

I love when we discuss C or C++, "Language extensions are a feature, not a bug", but then when discuss other languages that try to do without C, by adding extensios to their language reference, the speech turns into "Language extensions are a weakness, not a feature".

Additionlly "Language extensions are a feature, not a bug" seems only to be valid in the context of C and C++, IF the compilers being discussed are GCC or clang, because God forbid a comercial C or C++ compiler to have language extensions.

Speaking in general about the way these subjects are discussed online, not you in particular.

uecker

0 replies

1d1h

2024-08-19 17:31:11 UTC

It is entirely possible to write strictly conforming C.

ufo

5 replies

1d7h

2024-08-19 11:02:55 UTC

Yes. But the alternative is assembly language, which is even less portable.

Karliss

3 replies

1d6h

2024-08-19 11:39:24 UTC

The portable alternative is being explicit with your loops, possibly in combination with gigantic unwieldy switch statements or some regular goto (it is still part of standard). But that comes at the cost of readability and sometimes performance.Whether it's practical depends on the usecase. For something like recursive data structure algorithms which are relatively small and self contained, I would say it's perfectly doable. Simple interpreters - maybe. Complex interpreters - here it becomes messy.

ufo

0 replies

1d6h

2024-08-19 11:49:00 UTC

A switch-case is the default way to write an interpreter and I'd even argue it's the most readable.

In the context of this article, it's all about the performance. Switch-case generates suboptimal code for the commonly used fast paths of the protobuf parser, because the mere existence of the slow paths is enough to interfere with the performance of the code around them.

mananaysiempre

0 replies

1d6h

2024-08-19 11:47:05 UTC

See Mike Pall’s posts on the subject—the performance cost is considerable, for two reasons. First, you’re forcing the compiler to do register allocation for the whole interpreter at once, which it can virtually never do a good job of. (This is actually the more important part.)

Second, given the existence of branch target buffers (and the ruinous cost of mispredicted branches), you really want the instruction dispatch to be a single indirect branch at the end of each instruction implementation, and for that standard tools are somewhere between unhelpful (you can write a macro containing switch (*insn++) { case INSN_FOO: goto impl_foo; /* ... */ } but it’s anybody’s guess whether you’re getting a single jump table for all copies of that) and actively obstructive (“tail merging” in older versions of Clang would actively destroy any attempts at copying dispatch code). Granted, sometimes things work out (new Clang versions can sometimes go “looks like you’re writing an interpreter” and turn a vanilla switch in a loop into duplicated dispatch code). Then again, sometimes they don’t, and you can’t actually know.

OskarS

0 replies

1d6h

2024-08-19 11:47:21 UTC

Yeah, that's the way virtual machines have been written forever, some version of

    for instruction in instructions {
        switch (instruction) {
            case OPCODE_X: //....
            case OPCODE_Y: //....
            case OPCODE_Z: //....
        }
    }

This is how VMs have been written since the dawn of time (or using computed gotos, another non-standard addition to C). It has problems though, like the fact that the `switch` branch is extremely unpredictable, and that you get a massive function which is hard to optimize. This [[musttail]] trick is a huge improvement. But yeah, if you got to support compilers that don't have [[musttail]], you in essence have to have two implementations, the [[musttail]] one and the loop/switch one.

kachapopopow

0 replies

1d7h

2024-08-19 11:16:31 UTC

assembly is the most portable of them all even across platforms! (x86 to arm compilation meme)

flohofwoe

5 replies

1d6h

2024-08-19 11:37:21 UTC

The typical way to deal with this is to put the __attribute__ into a C macro which expands to nothing on compilers which don't understand GCC/Clang's __attribute__ keyword. The code without the attribute will still compile and most likely also apply the tail call optimization, you just don't get an error if the compiler can't apply the optimization.

Also TBF, hardly any real-world C code is strictly standard compliant. Many C compilers just agree on a common syntax that includes both the C standard and some popular non-standard extensions.

PS: C++ compilers actually ignore unknown attributes since the `[[attribute]]` syntax has been standardized in C++11. In GCC and Clang you'll get a warning in the standard warning set, but not in MSVC.

PPS: C23 also standardized the `[[attribute]]` syntax and also added a way to check for supported attributes:

https://en.cppreference.com/w/c/language/attributes

pjmlp

4 replies

1d6h

2024-08-19 12:11:48 UTC

It will compile, and eventually blow up nicely with a stack overflow OS fault.

Ah, the joys of writing "portable" C code with #ifdef spaghetti, across commercial UNIXes and their own C compilers, 20 years ago.

It only got better because for many people just like they assume Web == Chrome, C gets C == GCC, blessifully ignoring everything else.

Nowadays clang is also considered, mostly because a couple of companies wanted to replace GCC, and naturally clang needs to be able to match whatever GCC offers.

flohofwoe

1 replies

1d5h

2024-08-19 13:17:24 UTC

Well, if you use recursive code, you better know what you're doing. With or without tail call optimization.

eru

0 replies

1d3h

2024-08-19 14:32:41 UTC

Loops are just a special, limited case of recursion.

(And only necessary in languages that have trouble implementing function calls properly.)

Filligree

1 replies

1d5h

2024-08-19 12:45:57 UTC

It will compile, and eventually blow up nicely with a stack overflow OS fault.

Not at all guaranteed. Stack overflow is undefined behaviour, which means compilers can optimise your program on the assumption that it doesn’t happen.

pjmlp

0 replies

1d5h

2024-08-19 12:51:33 UTC

Ah true, yet another case that one tends to forget about.

jonstewart

0 replies

1d5h

2024-08-19 13:22:00 UTC

The article is pretty clear about this. When it comes to fast lexing and parsing, it is typical for projects to make portability tradeoffs in favor of performance. For example, simdjson is full of assembly.

eru

0 replies

1d3h

2024-08-19 14:31:54 UTC

You say it like it's a bad thing.

dist1ll

0 replies

1d6h

2024-08-19 11:37:19 UTC

Fwiw, many C projects are written in a non-standard C dialect, including the Linux kernel.

OskarS

0 replies

1d7h

2024-08-19 11:05:21 UTC

Yes, that is correct. You cannot do this trick in standard C, C++ or Rust, it requires some version of [[musttail]]. Strong argument for adding it to the C standard, IMHO.

jerf

2 replies

1d4h

2024-08-19 14:03:32 UTC

"you have to GUARANTEE it (or fail compilation)"

I've often pondered the utility of similar flags for other optimizations. This is perhaps the largest one, but there are other situations in certain code where I want to know that my optimization has failed.

A more complicated example is, I've sometimes wanted an assertion that a given function is inlined. We know from hard, repeated experience over decades that letting users annotate functions as inline directly doesn't end up working very well, but I've often wondered about creating an assertion that would fail the compile if it isn't. (Ideally with at least a hint as to why the compiler failed it out, but that's easier said than done.) Obviously you don't want to go slapping this in some major open source library that's going to be compiled by half-a-dozen compilers on dozens of operating systems, but for my own code in my own situation it can be an optimization that is the difference between success and failure and it'd be nice to flag a failure.

(Bear in mind I am not proposing to blindly take any particular action if it triggers. The point is to bring it up to human attention and not have it buried deep in invisible, yet potentially consequential, compiler decisions. The human deciding "I guess this optimization isn't happening" and removing the annotation would be one valid decision, for instance.)

ndesaulniers

0 replies

1d1h

2024-08-19 16:51:03 UTC

I've sometimes wanted an assertion that a given function is inlined.

Try `__attribute__((error("not inlined")))` or `warning` on the callee.

OskarS

0 replies

1d4h

2024-08-19 14:19:55 UTC

I agree, this would be useful. Another one I would like is auto-vectorization, a way to mark a loop with an attribute and if it fails to auto-vectorize, the compiler should print out the auto-vectorization report for that loop, explaining why it happened. It's such a brittle optimization, but it's crucial for a tiny number of extremely hot loops, you would want to know if it failed due to some code change or compiler upgrade. Also, it's just a pain to use auto-vectorization report normally.

mananaysiempre

5 replies

1d6h

2024-08-19 12:00:44 UTC

As the disassembly in the post demonstrates, the problem with the fallback path (which is not necessarily the error path) is not how fast the call to it is, it's that the mere existence of that call can force the compiler to create a stack frame and spill registers into it for the whole function, including the fast path.

OK, maybe “force” is not the right word—nobody says the compiler has to have a single stack frame structure for all possible execution paths of a function. Nobody even says it has to use the standard ABI for a no-linkage (static or anonymous-namespace) function (that doesn’t have its address taken). But the reality is, all compilers I’ve seen do, including Clang, so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call.

Re your nice jump table, sure it does. But if you try running the result under, say, perf report, and your test bytecode doesn’t describe a short loop, you’ll see one of two things: either you had a branch mispredict on each dispatch; or the compiler went “looks like you’re trying to write an interpreter” and moved the indirect jump to the end of each case (I’ve seen Clang do this). And either way the register allocation in the resulting code probably sucks.

jcelerier

4 replies

1d6h

2024-08-19 12:15:38 UTC

so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call

that's what -fvisibility=internal already does, no?

mananaysiempre

3 replies

1d5h

2024-08-19 12:44:13 UTC

That’s what static could do (if the function’s address is not taken, or given sufficiently powerful dataflow analysis), but C and C++ compilers don’t take advantage of that. Getting that out of -fvisibility=hidden -flto would also be possible, but requires even more nonexistent compiler smarts. (From a quick web search, I can't figure out what internal visibility brings over hidden.)

(Granted, it’s not like this is completely impossible—I seem to remember GHC and MLton can invent custom calling conventions for Haskell and SML respectively. But the popular C or C++ compilers can’t.)

kccqzy

2 replies

1d2h

2024-08-19 15:33:55 UTC

Really? I've always assumed that static gives C/C++ compilers the right to disregard calling conventions altogether. Now that I think about it, this assumption might be unjustified. Was there any strong reason for compilers to keep the calling convention?

zzo38computer

0 replies

2024-08-19 18:04:36 UTC

I think that "static register" should make the compiler allowed to disregard calling conventions (if optimizations are enabled), but neither "static" nor "register" alone should do so. (In a discussion on an IRC some time ago, I had suggested what "register" alone should do in my opinion, which is something else than this.)

mananaysiempre

0 replies

2024-08-19 17:53:17 UTC

I mean, it does give them that right as far as I know [with, again, the caveat of function pointers—you can’t pass a non-ABI-compliant function to qsort(), however static it is—and an additional one of inline assembly], it’s just that they choose not to exercise it. Why? Not sure. Maybe it’s too difficult (register allocation and spilling is NP-complete and fairly expensive). Maybe it’s too difficult when your compiler starts out as a function-at-a-time one and builds on that. The popular ones don’t do that, is my point. (Would be interesting to check what SDCC does, by the way, on register-starved micros such tricks could make sense.)

pizlonator

20 replies

1d3h

2024-08-19 14:33:23 UTC

The way interpreters usually achieve this kind of speedup in C++ is using computed goto. Then there’s no calling convention junk on the path from one opcode to the next.

Also, the main reason why interpreters get faster using either the computed goto style or the tail style versus a classic switch loop is that it reduces pressure on the branch predictor since there’s statically one indirect branch per opcode rather than statically just one indirect branch.

tylerhou

7 replies

1d2h

2024-08-19 16:17:03 UTC

(As the article claims) even with computed goto register assignment of the most frequently used variables is fragile because the CFG of the function is so complicated.

pizlonator

4 replies

1d1h

2024-08-19 17:09:51 UTC

It's also fragile, in a different way, if you're threading state through tail calls.

In my experience writing computed goto interpreters, this isn't a problem unless you have more state than what can be stored in registers. But then you'd also have that same problem with tail calls.

haberman

3 replies

2024-08-19 18:19:39 UTC

Fallback paths most definitely have more state than what can be stored in registers. Fallback paths will do things like allocate memory, initialize new objects, perform complicated fallback logic, etc. These fallback paths will inevitably spill the core interpreter state.

The goal is for fast paths to avoid spilling core interpreter state. But the compiler empirically has a difficult time doing this when the CFG is too connected. If you give the compiler an op at a time, each in its own function, it generally does much better.

pizlonator

2 replies

2024-08-19 18:30:45 UTC

I get that and that’s also been my experience, just not for interpreters.

In interpreters, my experience is that fallback paths are well behaved if you just make them noinline and then ensure that the amount of interpreter state is small enough to fit in callee save regs.

haberman

1 replies

23h43m

2024-08-19 18:48:08 UTC

Mike Pall makes an argument that interpreters are especially susceptible to this problem, and I find it convincing, since it matches my experience: https://web.archive.org/web/20180331141631/http://article.gm...

pizlonator

0 replies

23h32m

2024-08-19 18:58:56 UTC

There are a bunch of arguments in there that don't match my experience, which includes the JSC interpreter. JSC had an interpreter written in C++ and one written in assembly, and the main reason for using the assembly one was not raw perf - it was so the interpreter knows the JIT's ABI for fast JIT<->interpreter calls.

Mike's argument about control flow diamonds being bad for optimization is especially questionable. It's only bad if one branch of the diamond uses a lot more registers than the other, which as I said, can be fixed by using noinline.

ufo

1 replies

23h12m

2024-08-19 19:19:08 UTC

Exactly. Computed goto helps with branch prediction, but does not help w.r.t register allocation & other compiler optimizations.

pizlonator

0 replies

21h16m

2024-08-19 21:14:17 UTC

As I mentioned in another part of the thread - the way you get that under control in a computed goto interpreter (or a switch loop interpreter) is careful use of noinline.

Also, it probably depends a lot on what you’re interpreting. I’ve written, and been tasked with maintaining, computed goto interpreters for quite a few systems and the top problem was always the branches and never the register pressure. My guess is it’s because all of those systems had good noinline discipline, but it could also just be how things fell out for other reasons.

hinkley

6 replies

1d1h

2024-08-19 17:25:30 UTC

I’ve heard this went out of fashion a while back. That branch predictors got good enough that it’s not necessary anymore.

But I wonder if that stays true as the size of the interpreter increases.

pizlonator

5 replies

2024-08-19 17:32:09 UTC

My most recent data says it's still relevant.

It might not matter for very small interpreters, but it does matter for anything substantial.

Definitely worth remeasuring though.

pjmlp

3 replies

12h4m

2024-08-20 06:27:02 UTC

I would advocate that anything substancial is better off with a JIT, even a dumb one e.g. template JIT, we aren't dealing with 8 bit home computers memory constraints any longer, except in some IoT deployments.

pizlonator

2 replies

4h23m

2024-08-20 14:07:35 UTC

But then you’ll still have an interpreter for fast start.

That’s why JSC and V8 have interpreters as their bottom tiers.

And you’ll also want an interpreter if you don’t have permissions to JIT, which is common these days.

pjmlp

1 replies

4h1m

2024-08-20 14:29:56 UTC

The only place where there isn't a JIT permission are iDevices, and we all know that security isn't really the main reason.

Sure modern security concerns make use of W^X, and implementations have adapted accordingly.

Android has an interpreter hand written Assembly for fast start since Android 7, with a jump into JIT compilation as fast as possible, because it has been proven to be a better solution in the field.

The point being that a fast start doesn't really require an interpreter to win out micro-benchmarks game, rather be good enough until the tiered JITs step in.

pizlonator

0 replies

2h37m

2024-08-20 15:54:10 UTC

Not just iDevices. Gaming consoles also have such restrictions. So do safety critical systems (flight control systems often have some kind of interpreter but never a JIT as far as I know). There are probably other examples.

Fast start absolutely does require an interpreter and it is true that the jump to JIT is very quick - but that jump never happens for a large number of run-once functions (like initialization code or data definition code). Having that interpreter also saves you memory, since initialization code tends to be quite large.

titzer

0 replies

20h50m

2024-08-19 21:40:47 UTC

Threaded dispatch is worth 15-30% in Wizard's fast interpreter.

highfrequency

4 replies

1d1h

2024-08-19 16:57:41 UTC

there’s statically one indirect branch per opcode rather than statically just one indirect branch.

Could you elaborate on this with a couple more paragraphs? What do you mean by one indirect branch per opcode rather than just one indirect branch? How is this achieved?

pizlonator

2 replies

1d1h

2024-08-19 17:25:34 UTC

Sure.

Say you write an interpreter like this:

    for (;;) {
        switch (curInst->opcode) {
        case Inst::Foo:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += FooSize;
            break;
        case Inst::Bar:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += BarSize;
            break;
        ... more stuff like this
        }
    }

Then the generated code will have an indirect branch for the switch. This is costly for two reasons:

- Gotta load curInst->opcode, then lookup opcode in a compiler-generated jump table to get the code address of the case statement.

- Gotta indirect jump.

It turns out that the top cost (or even all of the cost) is the indirect jump. Maybe, the indirection is also some cost, at least on some CPUs. But remember, all jumps on modern CPUs are fast if predicted - regardless of the work seemingly required to do the jump. And they are slow if they are not predicted - and that slowness is about much more than the work required to do the jump (the cost is the CPU needing to roll back its speculations).

Reason: the indirect jump prediction the CPU is doing is keyed by the address of the indirect jump instruction. There is one indirect jump in the code above. And for any real program you'll execute, that indirect jump will end up hitting all (or at least most) of the opcodes. Therefore, the CPU has no easy path to predicting the indirect jump. Maybe it'll sometimes get lucky with more sophisticated predictors (like ones that track history and use that to salt the key used to lookup the predicted destinations, or ones that do more fancy stuff, like maybe some neural network to predict).

How do you make the indirect jump faster? Have one indirect jump per instruction! Both the computed goto approach and the tail call approach get you there.

Consider the computed goto version of the interpreter.

    Inst_foo_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += FooSize;
        goto *curInst->handler;
    Inst_bar_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += BarSize;
        goto *curInst->handler;
        ... more stuff like this

In this formulation of the interpreter, there is one indirect jump per instruction, rather than just one for the interpreter. Therefore, we're asking the CPU's predictor to do something simpler: to predict, for each instruction, what the next instruction will be. And then on top of that, the predictor still gets to use its neural network, or history buffer, or whatever. In terms of mathematical probability theory, it's like we're giving the CPU a first-order Markov chain, and that's sure to improve prediction accuracy. Empirically, it improves it a lot and it's a big speed-up. Here's yet another way to think of it. If I asked you to predict, "what's the next instruction", without any knowledge of what the prior one was, then you'd have a hard time - you'd only be able to reason about which instructions are most likely generally. But this computed goto interpreter is instead asking: "if I tell you the instruction I just executed, what's the next instruction", which gives you more leverage. Maybe adds are often followed by moves, for example.

The tail call style also achieves this, because each instruction's handler will have an indirect tail call (literally an indirect jump) to the next instruction, which again, gives the CPU that Markov chain goodness. So let's call both the computed goto style and the tail call style the "Markov style". If you could rewrite the switch statement so that there was one switch statement per instruction (and you could convince the compiler not to combine them into one, lmao) then that would also be a Markov-style interpreter.

As for the cost of indirectly loading from the compiler's switch table, or other issues like pushing and popping registers: in my experimentation, these costs are like drops in the ocean compared to the cost of indirect jumps. Even with the Markov style interpreter, the CPU spends most of its time mispredicting and rolling back. So the details of how the work happens for individual instructions are usually less important than what you do about the prediction of that indirect jump.

highfrequency

1 replies

2024-08-19 17:52:41 UTC

Awesome, thank you for expanding. Now I see the intuition that branch prediction accuracy could be much higher using the knowledge of the last opcode, so this becomes a game of massaging the code to prod the CPU to use more inputs into its branch prediction. Also helpful color on your empirical observation that branch prediction accuracy dominates other factors like switch indirection and loading registers.

There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?

Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement? Probably missing a couple obvious things here.

Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!

pizlonator

0 replies

23h18m

2024-08-19 19:12:35 UTC

How exactly do we force the CPU to use the last opcode in its branch prediction model?

In the tail call case: because each opcode is implemented by a function that has an indirect tail call at the end.

In the computed goto case: each `goto curInst->handler` is its own indirect branch.

Related to that: in your second example, what exactly happens in `goto

curInst->handler;`?

`curInst->handler` is a pointer that points to some label. The goto is an indirect branch to exactly that pointer value, i.e. that label.

It's a super crazy and dangerous feature! It obviates the need for a switch; it is the switch.

hinkley

0 replies

2024-08-19 17:35:11 UTC

Instead of a for loop where you hit a switch at the top of the loop, you jump to the code block for next instruction from the end of the previous instruction. That stops you from jumping to the top of the loop and then jumping a second time.

On older CPUs, you’re less likely to hit a pipeline stall. This technique was called “super-pipelining” for that reason. But a few years ago when this was discussed, someone pointed out that’s usually not necessary anymore. That branch predictors can see through double jumps now.

But as I alluded to in another reply, CPU caches are finite, and I have doubts whether in a fully realized interpreter, particularly one living side by side with a JIT, if that microbenchmark is lying to you about how fast the inner loop is under production conditions.

fuhsnn

16 replies

1d6h

2024-08-19 11:46:57 UTC

A C standard proposal exists for tail call [0], in the form of "return goto (expression);".

What I like about it, over standardizing [[musttail]], is that lifetimes of local objects are guaranteed to end. This makes it possible to implement without extensive escape analysis.

[0] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#5...

haberman

13 replies

1d2h

2024-08-19 16:30:18 UTC

Thanks for the pointer, I'll have to check this out.

Can you elaborate on how "return goto" is easier to implement? [[musttail]] also ends the lifetime of local objects AFAICS.

One thing I'll mention from a quick scan:

[The] function called in tail position must have identical type to the callee. This ensures both that the return value does not require any conversion, and also that argument passing space is available and calling convention (if relevant) is maintained.

One complaint I've seen repeatedly about [[musttail]] (which I implemented in Clang) is that this constraint is unnecessarily strict, since some architectures will allow tail calls even for functions that do not perfectly match: https://github.com/llvm/llvm-project/issues/54964

"But then the code would be nonportable." True, but tail call optimization is inherently nonportable, since some targets fundamentally do not support tail call optimization (eg. WASM without the tail call extension).

tines

10 replies

1d1h

2024-08-19 17:03:08 UTC

some targets fundamentally do not support tail call optimization

Can't any tail call be rewritten as a loop? Couldn't a WASM compiler without the tail call extension implement it this way?

tonyg

2 replies

1d1h

2024-08-19 17:25:56 UTC

Can't any tail call be rewritten as a loop?

No. In general tail calls cannot be rewritten into loops.

cobbal

1 replies

2024-08-19 18:11:54 UTC

More specifically, tail recursion is usually easy to turn into a loop. Tail calls can be difficult to turn into loops when they call a different function, or worse a function passed in as a variable.

eru

0 replies

13h7m

2024-08-20 05:23:24 UTC

To give the standard example:

Consider a state machine where each state is a function, and you transition into a different state by tail-calling another function.

State machines can have arbitrarily complicated graphs, that would be hard to put into a simple loop.

(However, you can do something like a 'trampoline' to remove the mutual recursion.)

slaymaker1907

2 replies

2024-08-19 17:56:31 UTC

It can't be rewritten as loop due to function pointers. Using JS notation to avoid noise:

function logAndCall(statement, func) { console.log(statement); return func(); }

Tail call optimization is actually possible here since we call func in "tail position". It might be unlikely to blow up the stack, but it can definitely happen if you do a lot of continuation passing.

Perhaps more commonly for C++/Rust, tail call optimization would be enormously valuable to have behind the scenes for destructors. It's actually very difficult to implement a linked list in safe Rust that doesn't explode the stack for large lists since no TCO is done for destroying subobjects. You can still avoid stack overflow, but you have to do things like manually enumerating the list.

derefr

1 replies

23h9m

2024-08-19 19:21:25 UTC

I don't get the problem. The generated code for your example would return the same function pointer + evaluated args list to the wrapping trampoline (i.e. the code that contains the loop), where that loop expects each thing it invokes to return a sum type NextInvoke(func, args) | Return(value).

And that's not a special case for passed-in function pointers; that's what a trampoline always does. Trampolines expect static-invoked tail-calls to be codegen'ed into function-pointer references too.

slaymaker1907

0 replies

1h25m

2024-08-20 17:05:57 UTC

A trampoline is less efficient theoretically because it implies that you must check a condition and that you can't make an unconditional tail-call. It's also not quite equivalent since it's a non-local compilation technique requiring the caller to do something differently.

It's completely different than the story with tail recursion which can be essentially reduced to syntactic sugar over a normal loop (just change the parameters to be mutable variables).

keithwinstein

2 replies

2024-08-19 18:26:52 UTC

Yes, wasm2c implements the Wasm tail-call feature with trampolines, exactly this way. (https://github.com/WebAssembly/wabt/blob/main/test/wasm2c/ta... has an example.)

Doing it with a trampoline is probably slower than if C really had tail calls. On the other hand, adding "real" tail calls to C would probably require changing the ABI (e.g. to "tailcc" or "fastcc -tailcallopt"), and I think there's some reason to think this would probably impose a penalty everywhere (https://llvm.org/docs/CodeGenerator.html#tail-call-optimizat...).

haberman

1 replies

23h50m

2024-08-19 18:40:13 UTC

On the other hand, adding "real" tail calls to C would probably require changing the ABI (e.g. to "tailcc" or "fastcc -tailcallopt")

But [[musttail]] does exactly this while respecting existing calling conventions: https://clang.llvm.org/docs/AttributeReference.html#musttail

keithwinstein

0 replies

21h25m

2024-08-19 21:05:30 UTC

No -- as discussed upthread, clang's musttail attribute requires the target function to have the same number of arguments as the caller and for each argument to be similar to the corresponding caller argument. That's stricter than the underlying LLVM musttail marker (when targeting the tailcc/swifttailcc calling conventions) and is too restrictive to implement Wasm's tail-call feature (and probably Scheme's, etc.), at least if arguments are getting passed to functions natively.

It would be nice if the more relaxed rules of the LLVM musttail marker with tailcc could be exposed in clang (and gcc). I think that's basically what "return goto" would do.

metadat

0 replies

1d1h

2024-08-19 17:08:17 UTC

Only with the aforementioned escape analysis. The function call stack frames serve a critical purpose in most nontrivial logic.

fuhsnn

1 replies

23h5m

2024-08-19 19:25:35 UTC

[[musttail]] also ends the lifetime of local objects AFAICS.

That's good to know. I had this github issue [0] in the back of my mind, as well as witnessing occasions of clang turning [[musttail]] into inner loops, and concluded clang's implementation must be more sophisticated than simply replacing calls with jumps. Just a little paranoia from trying to be serious with compiler dev[1]: fulfilling a laid-out spec feels more sound versus imitating something out there.

this constraint is unnecessarily strict

I would agree, at least for x86 psABI, it can be pretty elaborative as long as the return value is the same register and argument stack don't exceed what's provided. Debug/profiling side might hate it, though.

[0] https://github.com/llvm/llvm-project/issues/72555 [1] https://github.com/fuhsnn/slimcc/

haberman

0 replies

22h22m

2024-08-19 20:08:13 UTC

I certainly understand your caution. I don't have intimate expertise with the implementation of musttail in the backend -- when I implemented musttail in Clang, it was piggybacking on an existing attribute from the LLVM IR: https://llvm.org/docs/LangRef.html#call-instruction

That said, my rough understanding is that a tail call ends the lifetime of all objects in the old stack frame. It follows that it is UB to access any objects from the previous stack frame after a tail call, and that would include Gerben's first example in https://github.com/llvm/llvm-project/issues/72555

Your slimcc project looks really interesting, thanks for the pointer.

nxobject

1 replies

1d2h

2024-08-19 16:19:59 UTC

As an aside, I’m excited - but also with lots of trepidation - about the new energy in adding new functionality to C. Excited because there are changes and additions that are crying to be made, even clarifying ideas… trepidation because C++’s gung ho update cadence has somehow ended up in wart patched over wart. Especially when feature reacts with feature in an unfelicitous way far earlier than anticipated. I hope the standards process finds a way to be very conservative, really thoroughly test features in large and diverse codebases, rather than just relying on rationales alone, when choosing to include feature.

pjmlp

0 replies

22h42m

2024-08-19 19:49:02 UTC

What new energy? We have had C89, C99, C11, C17, C23

And yet, no fix in sight for proper strings, arrays, or at very least bounds checked slices, like Dennis Ritchie originally proposed to ANSI/ISO.

PaulRobinson

14 replies

1d7h

2024-08-19 10:42:15 UTC

Perhaps I've just been looking more (because I've been going back to the books to pick up C again after a 20 year absence for a side project), but it feels like in recent years there has been a little uptick in people appreciating the level of control C provides, and "going back to basics".

I think the work on C23 (which would have still been C2x when this article was written), means people are starting to want to see innovation in the language again. I don't think that would have happened without Go and Rust showing some great thinking, but it's interesting that it's happening at all.

Tail calls are an interesting idea, and now I want to know more about both this extension, and also how to write my code in such a way as to help a compiler optimize for tail calls even when this extension isn't present. This somehow feels more fun to me than writing more Python or Ruby...

mrweasel

8 replies

1d6h

2024-08-19 11:39:40 UTC

it feels like in recent years there has been a little uptick in people appreciating the level of control C provides, and "going back to basics".

I don't write C, and maybe it's because I somehow seek out these types of article and projects, but I'd agree, I'm seeing the same thing. It might be a backlash against programming languages like Rust or even JavaScript. Rust being really complicated, but clearly safer, and JavaScript... well it's just really hard to set up a development environment and the tooling is almost more predominant than the language itself.

While I don't write C myself, only for fun, I can see many just reverting to it, because it's "right there", it's fast, the language is fairly simple, even if the standard library is anything but simple, but you can take the stuff you need an ignore the rest.

tazu

7 replies

1d5h

2024-08-19 12:40:47 UTC

I've personally been very motivated to learn C (coming from Go) by witnessing @jart's[1] progress on Cosmopolitan[2] and Readbean[3]. Writing C is almost like an exercise in sovereignty: ultimate freedom (and danger).

[1]: https://justine.lol/

[2]: https://github.com/jart/cosmopolitan

[3]: https://news.ycombinator.com/item?id=26271117

jerf

5 replies

1d4h

2024-08-19 14:11:21 UTC

I think it's fine to go back to C and maybe play around a bit to learn about some of the things that can be done, but I would implore you to bear in mind that the decades have taught us that the "ultimate danger" in question is basically that you're building sand castles in a minefield. We're not talking "oh ha ha I guess I wrote a few more bugs that a stronger type system would have caught", we're talking "oh ha ha I guess remote unauthenticated attackers can run arbitrary assembler as root in my network code because I tripped over one of C's nominally well-known mines that I did not personally know about and all the attackers had to do was slightly tweak one of the exploit scripts already in metasploit to install root kits on my system and add it to their bot net".

The world has gotten a lot more dangerous than people realize. People generally quite correctly assume that hackers aren't going to spend person-months attacking their system personally but don't realize that the attacker tools are quite sophisticated now and they don't have to. Shoving code into a small little buffer overflow to lift to a larger buffer overflow to load exploit code over the network that will run a pre-packaged root priv escalation to install a rootkit to add them to their botnet is no longer the work of a team of hackers for months. It's all off-the-shelf tech now and it's more like writing a 20 line function now; you don't need to attract very much attention now to attract that level of sophistication.

We are leaving C behind collectively for very good reasons. If you are playing with C and you do not intimately understand those reasons, you're going to relearn them the hard way.

uecker

2 replies

2024-08-19 17:32:51 UTC

I do not think we are close to "leaving C behind collectively" and neither should we.

lanstin

0 replies

2024-08-19 18:23:52 UTC

I wrote 10s of thousands of lines of C code in the 90s and early 00s (without buffer overflows that I learned about; I did also write a evil network layer for inducing buffer overflows in my and my dependencies code), and have been doing a lot of other languages since then, and then had occasion to write some more C where I was doing string allocation and manipulation (for a LD_PRELOAD to monitor what various programs I lacked source to where doing), and it was absolutely nerve wracking. Linux kernel might be mostly C for a long time, but it would be crazy to start a new thing in C. There's growing re-write projects from C to Rust. It would be farther along except the Rust people seem to dislike laboriously recreating decades of GNU long-opt functionality in all these base packages to actually make Rust drop-in replacements for C.

Maybe for embedded, I haven't done that, but for general purpose, I can't imagine it being worth the risks.

jerf

0 replies

22h45m

2024-08-19 19:45:41 UTC

It is definitely on its way down, and it is a good thing.

You may think you can handle C. I disagree. The evidence is on my side. We need safer languages.

Heck, that even overstates it. It's not like we need super-safe languages per se. Maybe the 2070s will disagree with me. But we need languages that aren't grotesquely unsafe. It's not just that C isn't as safe as a language could be; it is that it is recklessly unsafe.

Interrupting that descent is foolishness, and pinning your career to a resurrection in C even more foolish. These technologies always have this meta-meta-contrarian phase to them just before they expire. I got into the computer field just as the meta-meta-contrarian "everything must be written in assembler" phase was at its peak. It was a false resurrection and I pity anyone who overinvested in learning how to write large applications in 100% assembler as a result of reading someone's over-enthusiastic screed about the virtues of writing pure assembler in 1998.

So I write this in the hopes of helping some other young HN reader not be fooled. C may be a thing you learn at some point, some day, just as assembler is still something you may learn. But don't get overexcited about the last meta-meta-contrarian false resurrection before death. Which is still years away, but at this point I think it's pretty much set on an irrevocable course.

jart

1 replies

14h26m

2024-08-20 04:04:32 UTC

Why do people have this idea that it's the language's job to protect you? C is a small piece of a much larger puzzle. That puzzle includes things like the memory page protection in your CPU's MMU. It includes things like SECCOMP BPF. It also includes things like ASAN, UBSAN, TSAN, etc. If you work in defense it might even include ASICs. The list goes on. Whatever language you believe in probably depends on C. You live in a C world. The value prop of the programming languages other than C/C++ for systems programming is that they build cohesive social communities and C is in the background serving them all. The whole "we're going to save you from the bogeyman" is just kool aid. No one will be saved.

coldpie

0 replies

5h27m

2024-08-20 13:03:33 UTC

Why do people have this idea that it's the language's job to protect you?

Because we have the benefit of hindsight. We tried putting the responsibility in programmers' hands. It didn't work. We're now learning from that experience and building better tools that can do the same job, more safely. It's foolish not to use them.

Whatever language you believe in probably depends on C. You live in a C world. The value prop of the programming languages other than C/C++ for systems programming is that they build cohesive social communities and C is in the background serving them all. The whole "we're going to save you from the bogeyman" is just kool aid.

No one is expecting the new languages to make computing suddenly error-free from top to bottom. It's about reducing the size of the surface where errors can be introduced. You're attacking a strawman here.

queuebert

0 replies

1d4h

2024-08-19 13:31:34 UTC

jart is the best thing to happen to C since K&R.

queuebert

1 replies

1d5h

2024-08-19 13:29:03 UTC

C is great for its basics, but at some level, when you gain awareness of all of your unpunished pointer lifetime sins and how annoying the lack of basic data structures is in stdlib, then you realize that biting the bullet and learning something like Rust is the way to go. Maybe a better stdlib would help, but I've found that the pain of becoming proficient in Rust has paid back multiples over the continual toil of doing certain repetitive tasks in C.

uecker

0 replies

2024-08-19 17:38:59 UTC

It is not terribly hard to find some library for data structures in C. What I do not like about Rust is: syntax, complexity, long compile times, cargo, ...

coldpie

1 replies

1d3h

2024-08-19 15:29:08 UTC

C is great and I definitely recommend learning it & becoming proficient in it. I don't think I would recommend it for a new project, for all the reasons jerf mentions in a sibling comment[1]. But there's a ton of existing code out there, and I think it provides a nice little peek into what the hardware is actually doing that you don't get from higher level languages.

I think the work on C23 means people are starting to want to see innovation in the language again.

I have to admit this makes me nervous. Part of what's nice about C is its simplicity and its stability. It's relatively easy to write a compiler, and the lack of having many different ways to do the same thing means all code written in C tends to "feel" roughly the same (outside some domain-specific areas), which means it's easy to dive into a new codebase and get up to speed. It's basically all structs & functions and abstractions built upon them. That's it.

While I definitely am not opposed to making improvements to the C spec, seeing the inline anonymous functions in a proposal linked elsewhere in this thread[2] really turns me off. It starts to feel like the hideous mess that C++ has turned into, with oodles of implicit behavior, un-greppable code, generated symbol names leading to compiler error hell, and a thousand different colors for your bike shed.

We already have languages that can do all this stuff. I welcome making changes to C to make it nicer to program in C, but I'm wary of proposals that turn C into yet another messy kitchen sink programming language. Let's keep it clean, obvious, and simple, please.

[1] https://news.ycombinator.com/item?id=41291206

[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#u...

pjmlp

0 replies

11h50m

2024-08-20 06:41:05 UTC

Easy, the direction WG14 is going for, is "C++ without classes, templates, improved memory safety".

Already there you can guess what features are on WG14 roadmap without going to the ISO mailings.

pjmlp

0 replies

11h52m

2024-08-20 06:38:13 UTC

The great thinking exists since Modula-2 and Ada, the only issue was that most people weren't interested in safer systems programming, now that money is attached to every CVE fix, and C isn't going away in our lifetimes, naturally there needs to exist a way to improve what is already there.

crabbone

7 replies

1d4h

2024-08-19 13:46:02 UTC

I wrote C Protobuf decoder / encoder as well as IML parser and bindings for Python. Here's something I have to say about measuring parsing speed.

* Since this library is only offered as a binding to some managed languages there are some extra caveats that, in term of performance overshadow everything else. I cannot speak to Ruby or PHP, but with Python I saw a dramatic speed increase when not using enumerators. I.e. if you translate Protobuf's enumerators into Python's enumerators any possible gains you may have in your C code will be trampled by creation times of different Python objects. The difference is many orders of magnitude. Similarly, you could go even further and implement all the supporting data-structures in C, and only expose minimal interface to them in Python. How fair would this sort of comparison be against code that uses Python's built-in structures is the question I cannot answer.

* Google's Protobuf parser for Python would be still "faster" than 2+GB/s because... it doesn't parse anything beside the top-level message. The internal structure of the message is parsed on-demand. This will be probably slower than 2+GB/s if your code instantly reads everything that was parsed, but the obvious question would be: how can you possibly compare these two approaches in practical terms? And, again, there isn't a clear answer, because the practical results will depend on the nature of your application.

* In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates). This means, that, in practice, code that parses Protobuf content will be bottlenecked on I/O (it needs to wait for the end of the message before any parsing starts). Independently, depending on the typical Probobuf message in your application, it might be possible to parallelize parsing, which, again, will most likely outrun any single-threaded parser, but, just as examples above, cannot be said to be a winning strategy in general.

* It's usually a lot more efficient to combine parsing with creation of domain objects, which is a step an application will almost always have to take anyways. How this functionality is accessible from the parser will in many cases determine which parser will win the race.

----

Bottom line: Protobuf (or maybe parser in general) is just a bad candidate to try to measure / compare speeds. It's too low-level and poorly designed to be a good measuring stick for performance benchmarks.

jeffbee

6 replies

1d4h

2024-08-19 14:05:44 UTC

In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates)

I don't see how last field wins stops you from streaming parse. Please elaborate.

crabbone

5 replies

1d3h

2024-08-19 15:14:07 UTC

Streaming in this context means that the parsing code hands off some parsed structures to the code outside of the parser before the entire message is processed. Suppose now you have this message:

    { x: 1, y: 2, x: 3 }

The streaming parser then reads field "x" with the value of "1", dispatches that value to the outside code, then reads "y", then reads "x" again, but the outside code had a side-effect associated with that field, which it already performed (eg. the outside code was printing the value of the field). Now, you have a program with an error. The right output should've been:

    y: 2
    x: 3

but you got:

    x: 1
    y: 2
    x: 3

Might not be so bad, depending on circumstances, but if you are betting on a soccer game outcome...

jeffbee

2 replies

1d2h

2024-08-19 15:44:23 UTC

I see. But if this ambiguous repetition must be resolved, then it must be resolved either at input or output time. Protobuf seems to have optimized for the output case by allowing for updates to scalar fields by appending.

crabbone

1 replies

4h59m

2024-08-20 13:31:20 UTC

It doesn't need to be resolved at input time. Protobuf wire format allows repetition. If we want to be more pedantic, Protobuf wire format doesn't have a concept of dictionaries / hash-tables however you want to call them. It only has lists. What IML defines as "map" is, in fact, a list of key-value pairs, so there's no problem with repetition on the technical level.

However, the interpretation of the payload does require removing repetition, and in a bad way: the compliant implementation must remove all but the last matching key-value pair.

It's just plain stupid. But this stupidity isn't unique to Protobuf. There are plenty of stupid formats like this one. For example VHD (Microsoft's VM image format) puts some information necessary to interpret the image in what they call "footer" (i.e. at the end of the file). MOV (Mac QuickTime videos) and at least early versions of MP4 created on Macs used to put some metadata in the end of the file, making it impossible to stream them.

Unfortunately, it happens a lot that people design binary formats for the first time, and then the format succeeds for unrelated reasons. We have tons of bad but popular formats. PNG is up there at the top of the list together with HTML and PDF. Oh, and don't mention PSD -- that one would probably take the cake when it comes to the product of how bad it is and how popular it is...

jeffbee

0 replies

4h0m

2024-08-20 14:31:01 UTC

It sounds like you have decided this design is "stupid" because you don't understand the motivation. This feature allows a proxy to alter a protobuf message without decoding it. That is a significant efficiency win in some important cases.

10000truths

1 replies

2024-08-19 18:22:06 UTC

You could easily design a stream parser that rejects duplicates before it passes them off to the user, by maintaining a set of already encountered keys within the parser state. The space overhead isn't a concern unless your map/set has millions of entries, but your non-streaming parser would have choked from the memory usage long before then, anyways.

crabbone

0 replies

5h5m

2024-08-20 13:25:21 UTC

You could easily design a stream parser that rejects duplicates before it passes them off to the user, by maintaining a set of already encountered keys within the parser state.

You could, but you are not allowed to. Protobuf parsing requires that the last duplicate key wins.

iTokio

6 replies

1d5h

2024-08-19 12:44:04 UTC

The remaining issue with tail calls used to switch context, is that you’re using functions that must use a calling convention. And unfortunately they waste registers to restore state on function exit.

See the luajit remake blog for an exhaustive analysis and alternative using an intermediate compiler https://sillycross.github.io/2022/11/22/2022-11-22/

superdimwit

4 replies

1d4h

2024-08-19 14:10:55 UTC

Clang recently got a new calling convention that makes these tail calls much cheaper (avoids the need for the caller to preserve some registers). I can never remember the name - it’s either preserve_all or preserve_none (whose perspective is the preservation from?).

haberman

3 replies

1d1h

2024-08-19 17:11:54 UTC

preserve_none is the new one. It can be applied to the functions performing tail calls to allow them use of the full register space.

I even saw an enhancement recently that will make preserve_none allocate arguments in the registers that are usually callee-saved: https://github.com/llvm/llvm-project/pull/88333

This will make [[musttail]] + preserve_none a winning combination when used together, particularly when making non-tail calls to fallback functions that use a regular calling convention, because all the arguments to [[musttail]] functions can stay pinned in the callee-save registers.

I'm delighted, because this matches what I originally proposed back in 2021, except I called it "reverse_ccc" instead of "preserve_none": https://github.com/haberman/llvm-project/commit/e8d9c75bb35c...

preserve_all also exists, and has existed for a while. You could use it on fallback functions to help the tail calling functions avoid spilling. But this always seemed like an unsatisfactory solution to me, because it's too intrusive (and requires too much diligence) to tag a bunch of regular functions as preserve_all. It's much more practical to tag all the core interpreter functions as preserve_none.

jeffbee

1 replies

23h13m

2024-08-19 19:17:33 UTC

I see signs that Google itself is using preserve_none internally, since the public protobuf repo has PROTOBUF_CC (Calling Convention) but it is defined as nothing

  #define PROTOBUF_CC

Is there any chance of this getting out into the wild or is it too dangerous for us mortals?

haberman

0 replies

21h44m

2024-08-19 20:46:38 UTC

Since preserve_none is Clang-only and only available on recent compilers, it would introduce an ABI hazard between the core library and generated code. We don't want to cause crashes if you compile protobuf with one compiler and generated code with another.

Also, until quite recently preserve_none was incompatible with the sanitizers, but I believe this may have been fixed by: https://github.com/llvm/llvm-project/pull/81048

superdimwit

0 replies

23h17m

2024-08-19 19:14:07 UTC

Thanks!

hinkley

0 replies

2024-08-19 17:52:03 UTC

I’ve seen a few languages over the years drop and reacquire JIT layers. Some of it has to do with programmer skill and lessons learned, but some is also down to CPU generation.

Like everything else in CS, when the cost balance shifts for different kinds of operations, the best algorithm can shift back to something we haven’t used in fifteen, twenty years. It contributes a lot to the faddishness of programming. Just because we are bringing something back doesn’t mean there’s no reason. But we forget the reasons it wasn’t a panacea last time so that’s still a problem.

If your main JIT gets faster or slower, then the cost-benefit for running it changes, so the threshold to trigger it gets adjusted, and now the amount of code that runs in the other tiers shifts, which might make the amortized cost of that tier worse. It’s like balancing a double pendulum.

If you can make a JIT tier fast and dirty enough, you can skip the interpreter entirely. And, from my armchair position, it seems that the cognitive load of bookkeeping tasks between the interpreter and say two JITs is high enough that a few languages have mothballed the interpreter and used a JIT optimized for compile time not output speed.

And I don’t recall what language, but I’m pretty sure at least one team that did this ended up dropping an intermediate compiler as well, because of that balancing act I mentioned above. It was better to double down on two than to try to handle three.

gpderetta

4 replies

1d8h

2024-08-19 10:21:12 UTC

I very much hope that the attribute will catch on, spreading to GCC, Visual C++, and other popular compilers,

AFAIK, attribute musttail is in the process of being added to GCC (the patch is under review) with semantics compatible with clang.

ufo

1 replies

1d7h

2024-08-19 10:55:54 UTC

What about the preserve_most attribute? Is there any chance something like that will get into GCC? Without it, the non-tail calls ruin the interpreter.

gpderetta

0 replies

1d6h

2024-08-19 12:27:07 UTC

Maybe, but not there yet: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

In the meantime some inline assembly macro trickery might help.

edit: code duplication can also be obviated by templating your op function with a fast/slow parameter, with the fast variant tail-calling the slow variant when it cannot perform the fast path, while guarding the slow code via the compile time parameter. The downside is yet more code obfuscation of course.

katzinsky

0 replies

1d5h

2024-08-19 12:47:42 UTC

Given some of the other scheme-like features GNUC has it's surprising they're lagging on this one.

fweimer

0 replies

1d1h

2024-08-19 17:21:56 UTC

It's a hard problem because many ABIs cannot do tail calls even for very basic stuff, like calls to extern functions with matching argument and return types. It looks like Clang has some heuristics for changing call sequences for musttail calls. For example, it switches to noplt calls on i686. None of this is mentioned in the Clang documentation: https://clang.llvm.org/docs/AttributeReference.html#musttail

What's realistic here is that you get a diagnostic if the compiler cannot generate a tail call. For many users, that will likely be good enough. Guaranteeing a tail call as in Scheme is unlikely to happen.

nu11ptr

3 replies

1d5h

2024-08-19 12:31:45 UTC

If added to Clang then I would assume this means it got support from LLVM. If so, does this mean that Rust can now rely on tail calls in LLVM to support something like the `become` keyword?

ori_b

2 replies

1d3h

2024-08-19 14:42:27 UTC

The problem with doing it in rust is that most calls aren't tail calls, even if they look like tail calls. You need to invoke the destructors for any code path that can drop.

nu11ptr

1 replies

2024-08-19 18:10:05 UTC

Isn't that the purpose of `become`? I thought it was to say "this IS a tail call, error out if it is not". After that validation is done, then the compiler can drop as needed.

ori_b

0 replies

7h0m

2024-08-20 11:30:58 UTC

The compiler can't drop as needed, because the drop prevents things from being tail calls. A single drop in a function prevents any calls from being tail calls, and therefore, they can't be eliminated.

In idiomatic rust, this means very few functions can use become.

irdc

3 replies

1d8h

2024-08-19 10:30:42 UTC

I wonder how fast this would be when using a trampoline, i.e. returning the next function as a function pointer and calling that from an outer loop. That would have the advantage of being portable C.

chii

1 replies

1d7h

2024-08-19 11:26:28 UTC

the reason i suspect tail call optimization is fast would be because the resultant loop is predictable and thus CPU instruction prefetch and memory prefetch works very well.

Jumping via function pointers would probably not be as predictable, and you'd unlikely to see the same benefit.

Of course, one must measure, and i haven't.

gpderetta

0 replies

1d5h

2024-08-19 12:53:52 UTC

The tail call interpreter is also calling through a function pointer. The cost here is purely the call+ret overhead, which can be non-trivial when it is per opcode; on some micro-architectures there is also a limit on taken jumps per cycle (sometimes as low as one taken jump every other cycle).

edit: trampolining would also collapse all indirect jumps to a single source address which is not ideal for branch prediction.

soegaard

0 replies

1d6h

2024-08-19 11:34:57 UTC

C is often used as target language for compiler from higher level languages. The Scheme programming language requires all tail calls not to grow the stack. Therefore implementors have explored various techniques including trampolines. I don't have a citation, but you can find the answer in the papers on compiling Scheme to C. If there is no guarantee of TCO in the target language, then the generated programs will be slower.

Incidentally, this is why implementors of (especially) high level languages are annoyed that TCO was removed from the JavaScript specification. There are even solution for having TCO and still have stack inspection.

https://github.com/schemedoc/bibliography/blob/master/page8....

aidenn0

2 replies

22h47m

2024-08-19 19:43:36 UTC

It mentions C++ support; it would seem to me that C++ has very few tail-calls. Consider:

  foo()
  {
    auto a = SomeClassWithADestructor();
    return bar(); // This is not a tail-call because the destruction of a happens *after* the call to bar();
  }

mxz3000

1 replies

51m

2024-08-20 17:39:49 UTC

If the compiler can prove there isn't action-at-a-distance between those lines (this might be non-trivial), then can't the destructor be called before running bar ? Does the C++ spec necessarily say that destructors are called at the end of the block, compared to, let's say, as soon as the variable is no longer used?

aidenn0

0 replies

20m

2024-08-20 18:11:07 UTC

If the compiler can prove there isn't action-at-a-distance between those lines (this might be non-trivial), then can't the destructor be called before running bar?

Yes; there is an "as-if" rule in the C++ standard that says that the implementation must emulate only the "observable behavior" of the abstract machine defined by the standard. The observable behavior is: access through volatile lvalues, data written to files, and I/O to interactive devices.

Does the C++ spec necessarily say that destructors are called at the end of the block, compared to, let's say, as soon as the variable is no longer used?

Yes, as long as the destructor has any side effect:

If a variable with automatic storage duration has initialization or a destructor with side effects, an implementation shall not destroy it before the end of its block nor eliminate it as an optimization, even if it appears to be unused, except that a class object or its copy/move may be eliminated as specified in 11.10.

Other wise things like std::lock_guard[2] would not work; that's an example of a type that is never explicitly used after definition, but used purely for the side-effects of its constructor/destructor.

1: https://isocpp.org/files/papers/N4860.pdf

2: https://en.cppreference.com/w/cpp/thread/lock_guard

sp1rit

1 replies

1d7h

2024-08-19 10:49:42 UTC

I wonder how this behaves in combination with __attribute__((cleanup(...))). Especially if the to be cleaned variable is passed into the tail function as parameter.

Karliss

0 replies

1d6h

2024-08-19 11:55:17 UTC

You get an error that tail call can't be performed which is kind of the point of tailcall attribute. Same thing with regular c++ destructors or dozen other language features which interfere with tail calls, no need for extensions like __attribute__((cleanup()).

https://godbolt.org/z/Yex74Wjz7

ok123456

1 replies

19h15m

2024-08-19 23:15:53 UTC

What happens if you throw an exception in a C++ [[musttail]] function? Is the exception stack completely separate?

layer8

0 replies

18h45m

2024-08-19 23:45:43 UTC

The musttail return is not allowed to occur within a try block, or within the scope of variables with nontrivial destructors [0]. (In particular, this means that the function parameters cannot have nontrivial destructors.) Otherwise, within a musttail-called function, it's as if the original caller directly called that function, so exceptional stack unwinding is unaffected.

[0] https://www.mail-archive.com/gcc@gcc.gnu.org/msg95265.html

highfrequency

1 replies

1d2h

2024-08-19 15:51:12 UTC

Could someone clarify the example where an unlikely fallback path is wrapped in a musttail?

My understanding is that musttail is useful because it prevents stack frame construction and register spilling; the calling function basically says "you can re-use the existing stack and not worry about my local variables."

But doesn't the stack frame construction overhead / register spilling only occur when the fallback path is actually invoked; therefore if it is unlikely and you care less about performance in this slow path it doesn't matter if you wrap the fallback path in musttail?

(Is this purely a branch prediction effect, where the possibility of extra work needing to be done slows down the fast path even when the work does not need to be done?)

fweimer

0 replies

1d1h

2024-08-19 17:25:44 UTC

If the fallback path can be invoked repeatedly for one message (depending on message size), the tail call is a correctness issue because your stack is probably not going to be large enough to keep all the fallback path frames.

turbolent

0 replies

17h25m

2024-08-20 01:06:10 UTC

GCC just got musttail too: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83324#c27

nly

0 replies

1d2h

2024-08-19 15:31:17 UTC

perhaps the always_inline attribute could be useful to encourage the compiler to do the right thing also.

mbStavola

0 replies

1d2h

2024-08-19 15:56:29 UTC

For Rust enthusiasts, there is an old RFC[0] that would have added a "become" keyword which guaranteed tco. It was originally postponed in favor of focusing on the 2018 edition's goals (which was the right call) but the initiative has been revisited recently[1]. Maybe it'll make a comeback!

[0]: https://github.com/rust-lang/rfcs/pull/1888 [1]: https://github.com/rust-lang/rfcs/pull/3407

kasajian

0 replies

18h29m

2024-08-20 00:01:16 UTC

They can do it in C, but Google still can't do it in JS in V8 :eyeroll:

chmaynard

0 replies

1d5h

2024-08-19 13:04:55 UTC