Maybe the example is too simple, but it does not require `__attribute__((musttail))` for good code gen.
Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?
To me it seems like a structure of the form
<read data>
if (unlikely(malformed))
return error();
<prep more stuff>
switch (data_type) {
case x:
return handle_x();
case y:
return handle_y();
}
generates a nice jump table quite reliably.
Obviously compilers have been doing tail call elimination for ever, but for this trick "generates [tail calls] quite reliably" is not enough: you have to GUARANTEE it (or fail compilation), otherwise this structure does not work (it will blow the stack immediately). That's the point of [[musttail]], tail call elimination is required, that's the only choice the compiler has.
That means that any such code is not portable across compilers anymore. It is effectively written in a non-standard C dialect, because it requires a language extension to work correctly.
There's no such thing as "standard C" that you can actually write, due to UB and implementation defined behaviour. There's just (C, compiler version, platform) that defines (if only through the compiler's source code) what will actually happen in any given situation.
so because there are implementation defined behaviors in the standard, language extensions become okay?
Language extensions are a feature, not a bug. They allow C to evolve and C compilers to compete without requiring committee consensus. Good extensions will eventually be picked up by other compilers, and maybe even find their way into the standard.
sure, i get the joke, i just don't like it. it's the same story as browsers. proprietary extensions in the name of progress because technically it's allowed, but also unimplemented standardized features galore, necessitating polyfill libraries and frequent checking of support matrices.
it's just sprawl with popular support.
It is very different. A web browser is a (almost) fully self contained environment. Same with something like JAVA. On the other hand a standard C or C++ compiler+runtime has (by design) very little features. Anything beyond trivial programs has to reach to platform specific features. Even POSIX is often not enough.
not sure i'm following? i was referring exactly to those "few" language features, e.g.: https://en.cppreference.com/w/cpp/compiler_support
I love when we discuss C or C++, "Language extensions are a feature, not a bug", but then when discuss other languages that try to do without C, by adding extensios to their language reference, the speech turns into "Language extensions are a weakness, not a feature".
Additionlly "Language extensions are a feature, not a bug" seems only to be valid in the context of C and C++, IF the compilers being discussed are GCC or clang, because God forbid a comercial C or C++ compiler to have language extensions.
Speaking in general about the way these subjects are discussed online, not you in particular.
It is entirely possible to write strictly conforming C.
Yes. But the alternative is assembly language, which is even less portable.
The portable alternative is being explicit with your loops, possibly in combination with gigantic unwieldy switch statements or some regular goto (it is still part of standard). But that comes at the cost of readability and sometimes performance.Whether it's practical depends on the usecase. For something like recursive data structure algorithms which are relatively small and self contained, I would say it's perfectly doable. Simple interpreters - maybe. Complex interpreters - here it becomes messy.
A switch-case is the default way to write an interpreter and I'd even argue it's the most readable.
In the context of this article, it's all about the performance. Switch-case generates suboptimal code for the commonly used fast paths of the protobuf parser, because the mere existence of the slow paths is enough to interfere with the performance of the code around them.
See Mike Pall’s posts on the subject—the performance cost is considerable, for two reasons. First, you’re forcing the compiler to do register allocation for the whole interpreter at once, which it can virtually never do a good job of. (This is actually the more important part.)
Second, given the existence of branch target buffers (and the ruinous cost of mispredicted branches), you really want the instruction dispatch to be a single indirect branch at the end of each instruction implementation, and for that standard tools are somewhere between unhelpful (you can write a macro containing switch (*insn++) { case INSN_FOO: goto impl_foo; /* ... */ } but it’s anybody’s guess whether you’re getting a single jump table for all copies of that) and actively obstructive (“tail merging” in older versions of Clang would actively destroy any attempts at copying dispatch code). Granted, sometimes things work out (new Clang versions can sometimes go “looks like you’re writing an interpreter” and turn a vanilla switch in a loop into duplicated dispatch code). Then again, sometimes they don’t, and you can’t actually know.
Yeah, that's the way virtual machines have been written forever, some version of
This is how VMs have been written since the dawn of time (or using computed gotos, another non-standard addition to C). It has problems though, like the fact that the `switch` branch is extremely unpredictable, and that you get a massive function which is hard to optimize. This [[musttail]] trick is a huge improvement. But yeah, if you got to support compilers that don't have [[musttail]], you in essence have to have two implementations, the [[musttail]] one and the loop/switch one.assembly is the most portable of them all even across platforms! (x86 to arm compilation meme)
The typical way to deal with this is to put the __attribute__ into a C macro which expands to nothing on compilers which don't understand GCC/Clang's __attribute__ keyword. The code without the attribute will still compile and most likely also apply the tail call optimization, you just don't get an error if the compiler can't apply the optimization.
Also TBF, hardly any real-world C code is strictly standard compliant. Many C compilers just agree on a common syntax that includes both the C standard and some popular non-standard extensions.
PS: C++ compilers actually ignore unknown attributes since the `[[attribute]]` syntax has been standardized in C++11. In GCC and Clang you'll get a warning in the standard warning set, but not in MSVC.
PPS: C23 also standardized the `[[attribute]]` syntax and also added a way to check for supported attributes:
https://en.cppreference.com/w/c/language/attributes
It will compile, and eventually blow up nicely with a stack overflow OS fault.
Ah, the joys of writing "portable" C code with #ifdef spaghetti, across commercial UNIXes and their own C compilers, 20 years ago.
It only got better because for many people just like they assume Web == Chrome, C gets C == GCC, blessifully ignoring everything else.
Nowadays clang is also considered, mostly because a couple of companies wanted to replace GCC, and naturally clang needs to be able to match whatever GCC offers.
Well, if you use recursive code, you better know what you're doing. With or without tail call optimization.
Loops are just a special, limited case of recursion.
(And only necessary in languages that have trouble implementing function calls properly.)
Not at all guaranteed. Stack overflow is undefined behaviour, which means compilers can optimise your program on the assumption that it doesn’t happen.
Ah true, yet another case that one tends to forget about.
The article is pretty clear about this. When it comes to fast lexing and parsing, it is typical for projects to make portability tradeoffs in favor of performance. For example, simdjson is full of assembly.
You say it like it's a bad thing.
Fwiw, many C projects are written in a non-standard C dialect, including the Linux kernel.
Yes, that is correct. You cannot do this trick in standard C, C++ or Rust, it requires some version of [[musttail]]. Strong argument for adding it to the C standard, IMHO.
"you have to GUARANTEE it (or fail compilation)"
I've often pondered the utility of similar flags for other optimizations. This is perhaps the largest one, but there are other situations in certain code where I want to know that my optimization has failed.
A more complicated example is, I've sometimes wanted an assertion that a given function is inlined. We know from hard, repeated experience over decades that letting users annotate functions as inline directly doesn't end up working very well, but I've often wondered about creating an assertion that would fail the compile if it isn't. (Ideally with at least a hint as to why the compiler failed it out, but that's easier said than done.) Obviously you don't want to go slapping this in some major open source library that's going to be compiled by half-a-dozen compilers on dozens of operating systems, but for my own code in my own situation it can be an optimization that is the difference between success and failure and it'd be nice to flag a failure.
(Bear in mind I am not proposing to blindly take any particular action if it triggers. The point is to bring it up to human attention and not have it buried deep in invisible, yet potentially consequential, compiler decisions. The human deciding "I guess this optimization isn't happening" and removing the annotation would be one valid decision, for instance.)
Try `__attribute__((error("not inlined")))` or `warning` on the callee.
I agree, this would be useful. Another one I would like is auto-vectorization, a way to mark a loop with an attribute and if it fails to auto-vectorize, the compiler should print out the auto-vectorization report for that loop, explaining why it happened. It's such a brittle optimization, but it's crucial for a tiny number of extremely hot loops, you would want to know if it failed due to some code change or compiler upgrade. Also, it's just a pain to use auto-vectorization report normally.
As the disassembly in the post demonstrates, the problem with the fallback path (which is not necessarily the error path) is not how fast the call to it is, it's that the mere existence of that call can force the compiler to create a stack frame and spill registers into it for the whole function, including the fast path.
OK, maybe “force” is not the right word—nobody says the compiler has to have a single stack frame structure for all possible execution paths of a function. Nobody even says it has to use the standard ABI for a no-linkage (static or anonymous-namespace) function (that doesn’t have its address taken). But the reality is, all compilers I’ve seen do, including Clang, so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call.
Re your nice jump table, sure it does. But if you try running the result under, say, perf report, and your test bytecode doesn’t describe a short loop, you’ll see one of two things: either you had a branch mispredict on each dispatch; or the compiler went “looks like you’re trying to write an interpreter” and moved the indirect jump to the end of each case (I’ve seen Clang do this). And either way the register allocation in the resulting code probably sucks.
that's what -fvisibility=internal already does, no?
That’s what static could do (if the function’s address is not taken, or given sufficiently powerful dataflow analysis), but C and C++ compilers don’t take advantage of that. Getting that out of -fvisibility=hidden -flto would also be possible, but requires even more nonexistent compiler smarts. (From a quick web search, I can't figure out what internal visibility brings over hidden.)
(Granted, it’s not like this is completely impossible—I seem to remember GHC and MLton can invent custom calling conventions for Haskell and SML respectively. But the popular C or C++ compilers can’t.)
Really? I've always assumed that static gives C/C++ compilers the right to disregard calling conventions altogether. Now that I think about it, this assumption might be unjustified. Was there any strong reason for compilers to keep the calling convention?
I think that "static register" should make the compiler allowed to disregard calling conventions (if optimizations are enabled), but neither "static" nor "register" alone should do so. (In a discussion on an IRC some time ago, I had suggested what "register" alone should do in my opinion, which is something else than this.)
I mean, it does give them that right as far as I know [with, again, the caveat of function pointers—you can’t pass a non-ABI-compliant function to qsort(), however static it is—and an additional one of inline assembly], it’s just that they choose not to exercise it. Why? Not sure. Maybe it’s too difficult (register allocation and spilling is NP-complete and fairly expensive). Maybe it’s too difficult when your compiler starts out as a function-at-a-time one and builds on that. The popular ones don’t do that, is my point. (Would be interesting to check what SDCC does, by the way, on register-starved micros such tricks could make sense.)