return to table of content

The return of the frame pointers

adsharma
28 replies
13h41m

I was at Google in 2005 on the other side of the argument. My view back then was simple:

Even if $BIG_COMPANY makes a decision to compile everything with frame pointers, the rest of the community is not. So we'll be stuck fighting an unwinnable argument with a much larger community. Turns out that it was a ~20 year argument.

I ended up writing some patches to make libunwind work for gperftools and maintained libunwind for some number of years as a consequence of that work.

Having moved on to other areas of computing, I'm now a passive observer. But it's fascinating to read history from the other perspective.

starspangled
13 replies
13h13m

So we'll be stuck fighting an unwinnable argument with a much larger community.

In what way would you be stuck? What functional problems does adding frame pointers introduce?

tempay
6 replies
13h7m

It “wastes” a register when you’re not actively using them. On x86 that can make a big difference, though with the added registers of x86_64 it much less significant.

inkyoto
1 replies
7h28m

Wasting a register on comparatively more modern ISA's (PA-RISC 2.0, MIPS64, POWER, aarch64 etc – they are all more modern and have an abundance of general purpose registers) is not a concern.

The actual «wastage» is in having to generate a prologue and an epilogue for each function – 2x instructions to preserve the old frame pointer and set a new one up, and 2x instruction at the point of return – to restore the previous frame pointer.

Generally, it is not a big deal with an exception of a pathological case of a very large number of very small functions calling each other frequently where the extra 4x instructions per each such a function will be filling up the L1 instruction cache «unnessarily».

weebull
0 replies
2h43m

Those pathological cases are really what inlining is for, with the exception of any tiny recursive functions that can't be tail call optimised.

charleshn
1 replies
12h45m

It's not just the loss of an architectural register, it's also the added cost to the prologue/epilogue. Even on x86_64, it can make a difference, in particular for small functions, which might not be inlined for a variety of reasons.

Asooka
0 replies
3h5m

If your small function is not getting inlined, you should investigate why that is instead of globally breaking performance analysis of your code.

starspangled
0 replies
12h56m

Right, but I was asking about functional problems (being "stuck"), which sounded like a big issue for the choice.

nlewycky
0 replies
12h59m

It caused a problem when building inline assembly heavy code that tried to use all the registers, frame pointer register included.

adsharma
4 replies
12h35m

I wasn't talking about functional problems. It was a simple observation that big companies were not going to convince Linux distributors to add frame pointers anytime soon and that what those distributors do is relevant.

All of the companies involved believed that they were special and decided to build their own (poorly managed) distribution called "third party code" and having to deal with it was not my best experience working at these companies.

starspangled
3 replies
12h22m

Oh, I just assumed you were talking about Google's Linux distribution and applications it runs on its fleet. I must have mis-assumed. Re-reading... maybe you weren't talking about any builds but just whether or not to oppose kernel and toolchain defaulting to omit frame pointers?

adsharma
2 replies
12h17m

Google didn't have a Linux distribution for a long time (the one everyone used on the desktop was an outdated rpm based distro, we mostly ignored it for development purposes).

What existed was a x86 to x86 cross compilation environment and the libraries involved were manually imported by developers who needed that particular library.

My argument was about the cost of ensuring that those libraries were compiled with frame pointers when much of the open source community was defaulting to omit-fp.

dooglius
1 replies
5h25m

Would it not be easier to patch compilers to always assume the equivalent of -fno-omit-frame-pointer

adsharma
0 replies
39m

That was done in 2005. But the task of auditing the supply chain to ensure that every single shared library you ever linked with was compiled a certain way was still hard. Nothing prevented an intern or a new employee from checking in a library without frame pointers into the third-party repo.

In 2024, you'd probably create a "build container" that all developers are required to use to build binaries or pay a linux distributor to build that container.

But cross compilation was the preferred approach back then. So all binaries had a rpath (run time search path to look for shared library) that ignored the distributor supplied libraries.

Having come from a open source background, I found this system hard to digest. But there was a lot of social pressure to work as a bee in a system that thousands of other very competent engineers are using (quite successfully).

I remember briefly talking to a chrome OS related group who were using the "build your own custom distro" approach, before deciding to move to another faang.

jart
12 replies
12h2m

Please name the individuals who are blocking progress on frame pointers. It's such a clear and obvious win that the rest of us should have the opportunity to persuade them. https://news.ycombinator.com/item?id=34660813

quotemstr
10 replies
8h45m

The clear and obvious win would have been adoption of a universal userspace generic unwind facility, like Windows has --- one that works with multiple languages. Turning on frame pointers is throwing in the towel on the performance tooling ecosystem coordination problem: we can't get people to fix unwind information, so we do this instead? Ugh.

rwmj
9 replies
8h22m

Yes, although the universal mechanisms that have been proposed so far have been quite ridiculous - for example having every program handle a "frame pointer signal" in userspace, which doesn't account for the reality that we need to do frame unwinding thousands of times a second with the least possible overhead. Frame pointers work for most things, and where they don't work (interpreted code) you're often not that interested in performance.

quotemstr
7 replies
5h50m

every program handle a "frame pointer signal" in userspace

Yep. That's my proposal.

which doesn't account for the reality that we need to do frame unwinding thousands of times a second with the least possible overhead

Yes, it does. The kernel has to return to userspace anyway at some point, and pushing a signal frame during that return is cheap. The cost of signal delivery is the entry into the kernel, and after a perf counter overflow, you've already paid that cost. Why would the actual unwinding be any faster in the kernel than in userspace?

Also, so what if a thread enters the kernel and samples the stack multiple times before returning to userspace? While in the kernel, the userspace stack cannot change --- therefore, it's sufficient to delay userspace stack collection until the kernel returns to userspace anyway.

You might ask "Don't we have to restore the signal mask after handling the profiling signal?"

Not if you don't define the signal to change the signal mask. sigreturn(2) is optional.

rwmj
6 replies
5h11m

This sounds vastly more complex already than following a linked list. You've also ignored the other cost which is getting the stack trace data out of the program. Anyway I'm keen to see your implementation and test how it works in reality.

quotemstr
5 replies
4h59m

This sounds vastly more complex already than following a linked list.

Efficient things often end up being more complex and supporting more features that brute force approaches. Frame pointers have a hard time letting us interpret managed stack frames, for example, and a simplistic atomic-context in-kernel FP walker will stop traversing the stack if it hits a page that happens not to be resident.

You've also ignored the other cost which is getting the stack trace data out of the program

io_uring would be a good candidate --- no-privilege-transmission data flows. Even if you don't want to use it, you can have userspace batch up a few dozen userspace stack collections and flush them to the perf or ftrace event buffer all at once, at regular intervals. Doing so would amortize whatever reporting overhead you have in mind.

Anyway I'm keen to see your implementation and test how it works in reality

Ah, that word "reality", which is the last retort of people who've exhausted their technical arguments.

jart
2 replies
3h48m

I propose that a frame pointer daemon be introduced too, for managing the frame pointer signals. We shall modify _start() to open up an io_uring connection to SystemD so that a program may share its .eh_frame data. That way the kernel can still unwind its stack in case apt upgrade changes the elf inode.

quotemstr
1 replies
3h9m

Neither of you has identified anything technically wrong with unwinding via signal and neither of you has proposed a mechanism through which we might support semantically informative unwinding through paged-out code or interpreted languages.

Sarcasm is not a technical argument.

jart
0 replies
2h17m

I don't need to. Fedora and Ubuntu have already changed their policies to restore frame pointers. As far as I can tell, your proposal is no longer on the table. If you aren't willing to accept the decision, then you should at least understand that the onus is on you now to justify why things need to change.

rwmj
0 replies
3h43m

We have to deal with reality if we want to measure and improve software performance today. The current reality is that frame pointers are the best choice. Brendan's article outlines a couple of possible future scenarios where we turn frame pointers off again, but they require work that is not done yet (in one case, advances in CPUs).

loeg
0 replies
51m

Your argument would be more compelling without the swipe in the final sentence.

jart
0 replies
4h20m

Cosmopolitan Libc does frame pointer unwinding once per function call, when the --ftrace flag is passed. https://justine.lol/ftrace/

samatman
0 replies
1h47m

I think this came off somewhat aggressive. I vouched for the comment because flagging it is an absurd overreaction, but I also don't think pointing out isolated individuals would be of much help.

Barriers to progress here are best identified on a community level, wouldn't you say?

But people, please calm down. Filing an issue or posting to the mailing list to make a case isn't sending a SWAT team to people's home. It's a technical issue, one well within the envelope of topics which can be resolved politely and on the merits.

brcmthrowaway
0 replies
1h41m

What area?

rwmj
21 replies
8h31m

I'm glad he mentioned Fedora because it's been a tiresome battle to keep frame pointers enabled in the whole distribution (eg https://pagure.io/fesco/issue/3084).

There's a persistent myth that frame pointers have a huge overhead, because there was a single Python case that had a +10% slow down (now fixed). The actual measured overhead is under 1%, which is far outweighed by the benefits we've been able to make in certain applications.

menaerus
10 replies
6h12m

I believe it's a misrepresentation to say that "actual measured overhead is under 1%". I don't think such a claim can be universally applied because this depends on the very workload you're measuring the overhead with.

FWIW your results don't quite match the measurements from Linux kernel folks who claim that the overhead is anywhere between 5-10%. Source: https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...

   I didn't preserve the data involved but in a variety of workloads including netperf, page allocator microbenchmark, pgbench and sqlite, enabling framepointer introduced overhead of around the 5-10% mark.
Significance in their results IMO is in the fact that they measured the impact by using PostgreSQL and SQLite. If anything, DBMS are one of the best ways to really stress out the system.

brendangregg
4 replies
6h6m

Those are microbenchmarks.

menaerus
3 replies
6h5m

pgbench is not a microbenchmark.

brendangregg
2 replies
5h55m

From the docs: "pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over"

While it might call itself a benchmark, it behaves very microbenchmark-y.

The other numbers I and others have shared have been from actual production workloads. Not a simple program that tests same sequence of commands over and over.

weebull
0 replies
2h55m

Anything running a full database server is not micro.

menaerus
0 replies
4h42m

While pgbench might be "simple" program, as in a test-runner, workloads that are run by it are far from it. It runs TPC-B by default but can also run your own arbitrary script that defines whatever the workload is? It also allows to run queries concurrently so I fail to understand the reasoning of it "being simple" or "microbenchmarkey". It's far from the truth I think.

babel_
2 replies
5h29m

Those are numbers from 7 years ago, so they're beginning to get a bit stale as people start to put more weight behind having frame pointers and make upstream contributions to their compilers to improve their output. People put it at <1% from much more recent testing by the very R.W.M. Jones you're replying to [0] and separate testing by others like Brendan Gregg [1b], whose post this is commenting on (and included [1b] in the Appendix as well), with similar accounts by others in the last couple years. Oh, and if you use flamegraph, you might want to check the repo for a familiar name.

Some programs, like Python, have reported worse, 2-7% [2], but there is traction on tackling that [1a] (see both rwmj's and brendangregg's replies to sibling comments, they've both done a lot of upstreamed work wrt. frame pointers, performance, and profiling).

As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles. Plus, you can always disable it in specific hotspots later when needed, which is much easier than the reverse.

Something, something, premature optimisation -- though in seriousness, this information benefits actual optimisation, exactly because we don't have the information and understanding that would allow truly universal claims, precisely because things like this haven't been available, and so haven't been widely used. We know frame pointers, from additional register pressure and extended function prologue/epilogue, can be a detriment in certain hotspots; that's why we have granular control. But without them, we often don't know which hotspots are actually affected, so I'm sure even the databases would benefit... though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency, so even a claimed "10%" drop there probably doesn't impact actual real-world usage, but that's a reason why some of the most interesting profiling work lately has been from ideas like causal profilers and continuous profilers, which answer exactly that.

[0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar... [1a]: https://pagure.io/fesco/issue/2817#comment-826636 [1b]: https://pagure.io/fesco/issue/2817#comment-826805 [2]: https://discuss.python.org/t/the-performance-of-python-with-...

doctorpangloss
0 replies
2h17m

As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles.

Few can leverage that information because the open source software you are talking about lacks telemetry in the self hosted case.

The profiling issue really comes down to the cultural opposition in these communities to collecting telemetry and opening it for anyone to see and use. The average user struggles to ally with a trustworthy actor who will share the information like profiling freely and anonymize it at a per-user level, the level that is actually useful. Such things exist, like the Linux hardware site, but only because they have not attracted the attention of agitators.

Basically users are okay with profiling, so long as it is quietly done by Amazon or Microsoft or Google, and not by the guy actually writing the code and giving it out for everyone to use for free. It’s one of the most moronic cultural trends, and blame can be put squarely on product growth grifters who equivocate telemetry with privacy violations; open source maintainers, who have enough responsibilities as is, besides educating their users; and Apple, who have made their essentially vaporous claims about privacy a central part of their brand.

Of course people know the answer to your question. Why doesn’t Google publish every profile of every piece of open source software? What exactly is sensitive about their workloads? Meta publishes a whole library about every single one of its customers, for anyone to freely read. I don’t buy into the holiness of the backend developer’s “cleverness” or whatever is deemed sensitive, and it’s so hypocritical.

adrian_b
0 replies
2h21m

While improved profiling is useful, achieving it by wasting a register is annoying, because it is just a very dumb solution.

The choice made by Intel when they have designed 8086 to use 2 separate registers for the stack pointer and for the frame pointer was a big mistake.

It is very easy to use a single register as both the stack pointer and the frame pointer, as it is standard for instance in IBM POWER.

Unfortunately in the Intel/AMD CPUs using a single register is difficult, because the simplest implementation is unreliable since interrupts may occur between 2 instructions that must form an atomic sequence (and they may clobber the stack before new space is allocated after writing the old frame pointer value in the stack).

It would have been very easy to correct this in new CPUs by detecting that instruction sequence and blocking the interrupts between them.

Intel had already done this once early in the history of the x86 CPUs, when they have discovered a mistake in the design of the ISA, that interrupts could occur between updating the stack segment and the stack pointer. Then they had corrected this by detecting such an instruction sequence and blocking the interrupts at the boundary between those instructions.

The same could have been done now, to enable the use of the stack pointer as also the frame pointer. (This would be done by always saving the stack pointer in the top of the stack whenever stack space is allocated, so that the stack pointer always points to the previous frame pointer, i.e. to the start of the linked list containing all stack frames.)

barrkel
1 replies
5h28m

This isn't an argument for a default.

menaerus
0 replies
4h37m

I was not even trying to make one. I was questioning the validity of "1% overhead" claim by providing the counter-example from respectable source.

awaythrow999
4 replies
3h32m

Frame pointers are still a no-go on 32bit so anything that is IoT today.

The reason we removed them was not a myth but comes from the pre-64 bit days. Not that long ago actually.

Even today if you want to repurpose older 64 bit systems with a new life then this of optimization still makes sense.

Ideally it should be the default also for security critical systems because not everything needs to be optimized for "observability"

Narishma
3 replies
3h29m

Frame pointers are still a no-go on 32bit so anything that is IoT today.

Isn't that just 32-bit x86, which isn't used in IoT? The other 32-bit ISAs aren't register-starved like x86.

weebull
2 replies
2h49m

It would be, yes. x86 had very few registers, so anything you could do to free them up was vital. Arm 32bit has 32 general purpose registers I think, and RISC V certainly does. In fact there's no difference between 32 and 64 bit in that respect. If anything, 64-bit frame pointers make it marginally worse.

CountSessine
1 replies
2h23m

Sadly, no. 32-bit ARM only has 16 GPR’s (two of which are zero and link), mostly because of the stupid predication bits in the instruction encoding.

That said, I don’t know how valuable getting rid of FP on ARM is - I once benchmarked ffmpeg on 32-bit x86 before and after enabling FP and PIC (basically removing 2 GPRs) and the difference was huge (>10%) but that’s an extreme example.

fanf2
0 replies
28m

Arm32 doesn’t have a zero-value register. Its non-general-purpose registers are PC, LR, SP, FP – tho the link register can be used for temporary values.

brendangregg
2 replies
6h55m

Thanks; what was the Python fix?

rwmj
1 replies
6h35m

This was the investigation: https://discuss.python.org/t/python-3-11-performance-with-fr...

Initially we just turned off frame pointers for the Python 3.9 interpreter in Fedora. They are back on in Python 3.12 where it seems the upstream bug has been fixed, although I can't find the actual fix right now.

Fedora tracking bug: https://bugzilla.redhat.com/2158729

Fedora change in Python 3.9 to disable frame pointers: https://src.fedoraproject.org/rpms/python3.9/c/9b71f8369141c...

edwintorok
1 replies
5h55m

You probably already know, but with OCaml 5 the only way to get flamegraphs working is to either:

* use framepointers [1]

* use LBR (but LBR has a limited depth, and may not work on on all CPUs, I'm assuming due to bugs in perf)

* implement some deep changes in how perf works to handle the 2 stacks in OCaml (I don't even know if this would be possible), or write/adapt some eBPF code to do it

OCaml 5 has a separate stack for OCaml code and C code, and although GDB can link them based on DWARF info, perf DWARF call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563#issuecomment-193...)

If you need more evidence to keep it enabled in future releases, you can use OCaml 5 as an example (unfortunately there aren't many OCaml applications, so that may not carry too much weight on its own).

[1]: I haven't actually realised that Fedora39 has already enabled FP by default, nice! (I still do most of my day-to-day profiling on an ~CentOS 7 system with 'perf record --call-graph dwarf -F 47 -a', I was aware that there was a discussion to enable FP by default, but haven't noticed it has actually been done already)

namibj
0 replies
1h55m

No, LBR is an Intel-only feature.

tdullien
14 replies
11h17m

As much as the return of frame pointers is a good thing, it's largely unnecessary -- it arrives at a point where multiple eBPF-based profilers are available that do fine using .eh_frame and also manually unwinding high level language runtime stacks: Both Parca from PolarSignals as well the artist formerly known as Prodfiler (now Elastic Universal Profiling) do fine.

So this is a solution for a problem, and it arrives just at the moment that people have solved the problem more generically ;)

(Prodfiler coauthor here, we had solved all of this by the time we launched in Summer 2021)

weinzierl
4 replies
8h59m

Also I've heard that the whole .eh_frame unwinding is more fragile than a simple frame pointer. I've seen enough broken stack traces myself, but honestly I never tried if -fno-omit-frame-pointer would have helped.

tdullien
3 replies
8h49m

Yes and no. A simple frame pointer needs to be present in all libraries, and depending on build settings, this might not be the case. .eh_frame tends to be emitted almost everywhere...

So it's both similarly fragile, but one is almost never disabled.

The broader point is: For HLL runtimes you need to be able to switch between native and interpreted unwinds anyhow, so you'll always do some amount of lifting in eBPF land.

And yes, having frame pointers removes a lot of complexity, so it's net a very good thing. It's just that the situation wasnt nearly as dire as described, because people that care about profiling had built solutions.

quotemstr
2 replies
8h42m

Forget eBPF even -- why do the job of userspace in the kernel? Instead of unwinding via eBPF, we should ask userspace to unwind itself using a synchronous signal delivered to userspace whenever we've requested a stack sample.

bregma
1 replies
4h2m

Context switches are incredibly expensive. Given the sampling rate of eBPF profilers all the useful information would get lost in the context switch noise.

Things get even more complicated because context switches can mean CPU migrations, making many of your data useless.

quotemstr
0 replies
3h7m

What makes you think doing unwinding in userspace would do any more context switches (by which I think you mean privilege level transitions) than we do today? See my other comment on the subject.

Things get even more complicated because context switches can mean CPU migrations, making many of your data useless.

No it doesn't. If a user space thread is blocked on doing kernel work, its stack isn't going to change, not even if that thread ends up resuming on a different thread.

int_19h
2 replies
9h14m

PolarSignals is specifically discussed in the linked threads, and they conclude that their approach is not good enough for perf reasons.

tdullien
0 replies
8h49m

Oh nice, I can't find that - can you post a link?

javierhonduco
0 replies
6h0m

Curious to hear more about this. Full disclosure: I designed and implemented .eh_frame unwinding when I worked at Polar Signals.

Tomte
2 replies
11h13m

You mean we don‘t need accessible profiling in free software because there are companies selling it to us. Cool.

tdullien
0 replies
9h33m

Parca is open-source, Prodfiler's eBPF code is GPL, and the rest of Prodfiler is currently going through OTel donation, so my point is: There's now multiple FOSS implementations of a more generic and powerful technique.

brancz
0 replies
10h39m

Parca's user-space code is apache2 and the eBPF code is GPL.

searealist
0 replies
8h30m

I'm under the impression that eh_frame stack traces are much slower than frame pointer stack traces, which makes always-on profiling, such as seen in tcmalloc, impractical.

nemetroid
0 replies
6h45m

If you're sufficiently in control of your deployment details to ensure that BPF is available at all. CAP_SYS_PTRACE is available ~everywhere for everyone.

felixge
0 replies
7h58m

First of all, I think the .eh_frame unwinding y'all pioneered is great.

But I think you're only thinking about CPU profiling at <= 100 Hz / core. However, Brendan's article is also talking about Off-CPU profiling, and as far as I can tell, all known techniques (scheduler tracing, wall clock sampling) require stack unwinding to occur 1-3 orders of magnitude more often than for CPU profiling.

For those use cases, I don't think .eh_frame unwinding will be good enough, at least not for continuous profiling. E.g. see [1][2] for an example of how frame pointer unwinding allowed the Go runtime to lower execution tracing overhead from 10-20% to 1-2%, even so it was already using a relatively fast lookup table approach.

[1] https://go.dev/blog/execution-traces-2024

[2] https://blog.felixge.de/reducing-gos-execution-tracer-overhe...

Joker_vD
13 replies
12h12m

Of course, if you cede RBP to be a frame pointer, you may as well have two stacks, one which is pointed into by RBP and stores the activation frames, and the other one which is pointed into by RSP and stores the return addresses only. At this point, you don't even need to "walk the stack" because the call stack is literally just a flat array of return addresses.

Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

astrobe_
3 replies
8h42m

You may be ready for Forth [1] ;-). Strangely, the Wikipedia article apparently doesn't put forward that Forth allows access both to the parameter and the return stack, which is a major feature of the model.

[1] https://en.wikipedia.org/wiki/Forth_(programming_language)

mikewarot
1 replies
1h38m

Forth has a parameter stack, return stack, vocabulary stack

STOIC, a variant of Forth, includes a file stack when loading words

samatman
0 replies
1h29m

I'm not sure what you're referring to with "vocabulary stack" here, perhaps the dictionary? More of a linked list, really a distinctive data structure of its own.

samatman
0 replies
1h30m

That does seem like a significant oversight. >r and r>, and cousins, are part of ANSI Forth, and I've never used a Forth which doesn't have them.

naasking
2 replies
11h22m

It simplifies storage management. A stack frame is a simple bump pointer which is always in cache and only one guard page for overflow, in your proposal you need two guard pages and double the stack manipulations and doubling the chance of a cache miss.

imtringued
0 replies
6h36m

The reduceron had five stacks and it was faster because of it.

Joker_vD
0 replies
10h47m

Yes, two guard pages are needed. No, the stack management stays the same: it's just "CALL func" at the call site, "SUB RBP, <frame_size>" at the prologue and "ADD RBP, <frame_size>; RET" at the epilogue. As for chances of a cache miss... probably, but I guess you also double them up when you enable CFET/Shadow Stack so eh.

In exchange, it becomes very difficult for the stack smashing to corrupt the return address.

dan-robertson
2 replies
9h25m

Note the ‘shadow stacks’ CPU feature mentioned briefly in the article, though it’s more for security reasons. It’s pretty similar to what you describe.

rwmj
1 replies
8h21m

Shadow stacks have been proposed as an alternative, although it's my understanding that in current CPUs they hold only a limited number of frames, like 16 or 32?

amluto
0 replies
7h27m

You may be thinking of the return stack buffer. The shadow stack holds every return address.

sweetjuly
1 replies
4h17m

Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

You'd have to argue that the cost of moving things to this other page and managing two pointers (where one is less powerful in the ISA) is meaningfully cheaper than the other equally effective mitigation of stack cookies/protectors which are already able to provide protection only where needed. There is no real security benefit to doing this over what we currently have with stack protectors since an arbitrary read/write will still lead to a CFI bypass.

weebull
0 replies
2h36m

The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

The classic buffer overflow issue should spring immediately to mind. By having a separate return address stack it's far less vulnerable to corruption through overflowing your data structures. This stops a bunch of attacks which purposely put crafted return addresses into position that will jump the program to malicious code.

It's not a panacea, but generally keeping code pointers away from data structures is a good idea.

stefan_
0 replies
7h1m

While here, why do we grow the stack the wrong way so misbehaved programs cause security issues? I know the reason of course, like so many things it last made sense 30 years ago, but the effects have been interesting.

loeg
7 replies
13h19m

Brendan mentions DWARF unwinding, actually, and briefly mentions why he considers it insufficient.

haberman
6 replies
12h39m

The biggest objection seems to be the Java/JIT case. eh_frame supports a "personality function" which is AIUI basically a callback for performing custom unwinding. If the personality function could also support custom logic for producing backtraces, then the profiling sampler could effectively read the JVM's own metadata about the JIT'ted code, which I assume it must have in order to produce backtraces for the JVM itself.

loeg
5 replies
12h2m

This also seems like a big objection:

The overhead to walk DWARF is also too high, as it was designed for non-realtime use.
kouteiheika
2 replies
10h13m

Not a problem in practice. The way you solve it is to just translate DWARF into a simpler representation that doesn't require you to walk anything. (But I understand why people don't want to do it. DWARF is insanely complex and annoying to deal with.)

Source: I wrote multiple profilers.

loeg
1 replies
1h17m

In this thread[1] we're discussing problems with using DWARF directly for unwinding, not possible translations of the metadata into other formats (like ORC or whatever).

[1]: https://news.ycombinator.com/item?id=39732010

kouteiheika
0 replies
38m

I wasn't talking about other formats. I was talking about preloading the information contained in DWARF into a more efficient in-memory representation once when your profiler starts, and then the problem of "the overhead is too high for realtime use" disappears.

menaerus
0 replies
8h41m

From https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf

    DWARF-based unwinding can be a bottleneck for time-sensitive program analysis tools. For instance the perf profiler is forced to copy the whole stack on taking each sample and to build the backtraces offline: this solution has a memory and time overhead but also serious confidentiality and security flaws.
So if I get this correctly, the problem with DWARF is that building the backtrace online (on each sample) in comparison to frame pointers is an expensive operation which, however, can be mitigated by building the backtrace offline at the expense of copying the stack.

However, paper also mentions

    Similarly, the Linux kernel by default relies on a frame pointer to provide reliable backtraces. This incurs in a space and time overhead; for instance it has been reported (https://lwn.net/Articles/727553/) that the kernel’s .text size increases by about 3.2%, resulting in a broad kernel-wide slowdown.
and

    Measurements have shown a slowdown of 5-10% for some workloads (https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy@suse.de/T/#u).

haberman
0 replies
2h25m

But that one has at least some potential mitigation. Per his analysis, the Java/JIT case is the only one that has no mitigation:

Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.
javierhonduco
0 replies
5h53m

There's always room for improvement, for example, Samply [0] is a wonderful profiler that uses the same APIs that `perf` uses, but unwinds the stacks as they come rather than dumping them all to disk and then having to process them in bulk.

Samply unwinds significantly faster than `perf` because it caches unwind information.

That being said, this approach still has some limitations, such as that very deep stacks won't be unwound, as the size of the process stack the kernel sends is quite limited.

- [0]: https://github.com/mstange/samply

claytonwramsey
8 replies
11h56m

That's very interesting to me - I had seen the `[unknown]` mountain in my profiles but never knew why. I think it's a tough thing to justify: 2% performance is actually a pretty big difference.

It would be really nice to have fine-grained control over frame pointer inclusion: provided fine-grained profiling, we could determine whether we needed the frame pointers for a given function or compilation unit. I wouldn't be surprised if we see that only a handful of operations are dramatically slowed by frame pointer inclusion while the rest don't really care.

naasking
3 replies
11h18m

2% performance is actually a pretty big difference.

No it's not, particularly when it can help you identify hotspots via profiling that can net you improvements of 10% or more.

pm215
2 replies
8h45m

Sure, but how many of the people running distro compiled code do perf analysis? And how many of the people who need to do perf analysis are unable to use a with-frame-pointers version when they need to? And how many of those 10% perf improvements are in common distro code that get upstreamed to improve general user experience, as opposed to being in private application code?

If you're netflix then "enable frame pointers" is a no-brainer. But if you're a distro who's building code for millions of users, many of whom will likely never need to fire up a profiler, I think the question is at least a little trickier. The overall best tradeoff might end up being still to enable frame pointers, but I can see the other side too.

samatman
0 replies
1h25m

I would say the question here is what should be the default, and that the answer is clearly "frame pointers", from my point of view.

Code eking out every possible cycle of performance can enable a no-frame-pointer optimization and see if it helps. But it's a bad default for libc, and for the kernel.

jart
0 replies
1h49m

It's not a technical tradeoff, it's a refusal to compromise. Lack of frame pointers prevents many groups from using software built by distros altogether. If a distro decides that they'd rather make things go 1% faster for grandma, at the cost of alienating thousands of engineers at places like Netflix and Google who simply want to volunteer millions of dollars of their employers resources helping distros to find 10x performance improvements, then the distros are doing a great disservice to both grandma and themselves.

rwmj
0 replies
7h10m

You can turn it on/off per function by attaching one of these GCC attribute to the function declaration (although it doesn't work on LLVM):

  __attribute__((optimize("no-omit-frame-pointer")))
  __attribute__((optimize("omit-frame-pointer")))

rwmj
0 replies
8h20m

The measured overhead is slightly less than 1%. There have been some rare historical cases where frame pointers have caused performance to blow up but those are fixed.

loeg
0 replies
11h13m

It’s usually a lot less than 2%.

inglor_cz
0 replies
8h48m

The performance cost in your case may be much smaller than 2 per cent.

Don't completely trust the benchmarks on this; they are a bit synthetic and real-world applications tend to produce very different results.

Plus, profiling is important. I was able to speed up various segments of my code by up to 20 per cent by profiling them carefully.

And, at the end of the day, if your application is so sensitive about any loss of performance, you can simply profile your code in your lab using frame pointers, then omit them in the version released to your customers.

5-
8 replies
10h35m

so what is the downside to using e.g. dwarf-based stack walking (supported by perf) for libc, which was the original stated problem?

in the discussion the issue gets conflated with jit-ted languages, but that has nothing to do with the crusade to enable frame pointer for system libraries.

and if you care that much for dwarf overhead... just cache the unwind information in your system-level profiler? no need to rebuild everything.

brancz
4 replies
9h56m

The way perf does it is slow, as the entire stack is copied into user-space and is then asynchronously unwound.

This is solvable as Brendan calls out, we’ve created an eBPF-based profiler at Polar Signals, that essentially does what you said, it optimized the unwind tables, caches them in bpf maps, and then synchronously unwinds as opposed to copying the whole stack into user-space.

stefan_
1 replies
6h55m

This conveniently sidesteps the whole issue of getting DWARF data in the first place, which is also still a broken disjointed mess on Linux. Hell, Windows solved this many many years ago.

bregma
0 replies
3h54m

You'd need a pretty special distro to have enabled -fno-asynchronous-unwind-tables by default in its toolchain.

By default on most Linux distros the frame tables are built into all the binaries, and end up in the GNU_EH_FRAME segment, which is always available in any running process. Doesn't sound a broken and disjointed mess to me. Sounds more like a smoothly running solved problem.

Sesse__
1 replies
6h6m

It should also be said that you need some sort of DWARF-like information to understand inlining. If I have a function A that inlines B that in turn inlines C, I'd often like to understand that C takes a bunch of time, and with frame pointers only, that information gets lost.

javierhonduco
0 replies
5h57m

Inlined functions can be symbolized using DWARF line information[0] while unwinding requires DWARF unwind information (CFI), which the x86_64 ABI mandates in every single ELF in the `.eh_frame` section

- [0] This line information might or might not be present in an executable but luckily there's debuginfod (https://sourceware.org/elfutils/Debuginfod.html)

yxhuvud
1 replies
10h0m

The article explains why DWARF is not an option.

menaerus
0 replies
9h11m

Extremely light on the details, and also conflates it with the JIT which makes it harder to understand the point, so I was wondering about the same thing as well.

weebull
3 replies
2h19m

Just as a general comment on this topic...

The fact that people complain about the performance of the mechanism that enables the system to be profiled, and so performance problems be identified, is beyond ironic. Surely the epitome of premature optimisation.

AtlasBarfed
1 replies
1h45m

So what are these other techniques the 2004 migration from frame pointers assumed would work for stack walking? Why don't they work today? I get that _64 has a lot more registers, so there's minimal value to +1 the register?

loeg
0 replies
55m

In 2004, the assumption made by the GCC developers was that you would be walking stacks very infrequently, in a debugger like GDB. Not sampling stacks 1000s of times a second for profiling.

doubloon
0 replies
2h6m

im sure in ancient mesopotamia there was somebody arguing about you could brew beer faster if you stop measuring the hops so carefully but then someone else was saying yes but if you dont measure the hops carefully then you dont know the efficiency of your overall beer making process so you cant isolate the bottlenecks.

the funny thing is i am not sure if the world would actually work properly if we didn't have both of these kinds of people.

shaggie76
3 replies
5h53m

I thought we'd been using /Oy (Frame-Pointer Omission) for years on Windows and that there was a pdata section on x64 that was used for stack-walking however to my great surprise I just read on MSDN that "In x64 compilers, /Oy and /Oy- are not available."

Does this mean Microsoft decided they weren't going to support breaking profilers and debuggers OR is there some magic in the pdata section that makes it work even if you omit the frame-pointer?

quotemstr
0 replies
3h2m

Microsoft has had excellent universal unwinding support for decades now. I'm disappointed to see someone as prominent as this article's author present as infeasible what Microsoft has had working for so long.

musjleman
0 replies
1h20m

In x64 compilers

The default is omission. If you have a Windows machine, in all likelihood almost no 64 bit code running on it has frame pointers.

OR is there some magic in the pdata section that makes it work even if you omit the frame-pointer

You haven't ever needed frame pointers to unwind using ... unwind information. The same thing exists for linux as `.eh_frame` section.

sesm
3 replies
8h34m

glibc is only 2 MB, why Chrome relies on system glibc instead of statically linking their own version with frame pointers enabled?

nolist_policy
1 replies
6h54m

At the very least Chrome needs to link to the system libGL.so and friends for gpu acceleration, libva.so for video acceleration, and so on. And these are linked against glibc of course.

dooglius
0 replies
5h31m

having/omitting frame pointers doesn't change the ABI; it will work if you compile against glibc-nofp and link against glibc-withfp

javierhonduco
3 replies
5h32m

Overall, I am for frame pointers, but after some years working in this space, I thought I would share some thoughts:

* Many frame pointer unwinders don't account for a problem they have that DWARF unwind info doesn't have: the fact that the frame set-up is not atomic, it's done in two instructions, `push $rbp` and `mov $rsp $rbp`, and if when a snapshot is taken we are in the `push`, we'll miss the parent frame. I think this might be able to be fired by inspecting the code, but I think this might only be as good as a heuristic as there could be other `push %rbp` unrelated to the stack frame. I would love to hear if there's a better approach!

* I developed the solution Brendan mentions which allows faster, in-kernel unwinding without frame pointers using BPF [0]. This doesn't use DWARF CFI (the unwind info) as-is but converts it into a random-access format that we can use in BPF. He mentions not supporting JVM languages, and while it's true that right now it only supports JIT sections that have frame pointers, I planned to implement a full JVM interpreter unwinder. I have left Polar Signals since and shifted priorities but it's feasible to get a JVM unwinder to work in lockstep with the native unwinder.

* In an ideal world, enabling frame pointers should be done on a case-by-case. Benchmarking is key, and the tradeoffs that you make might change a lot depending on the industry you are in, and what your software is doing. In the past I have seen large projects enabling/disabling frame pointers not doing an in-depth assessment of losses/gains of performance, observability, and how they connect to business metrics. The Fedora folks have done a superb and rigorous job here.

* Related to the previous point, having a build system that enables you to change this system-wide, including libraries your software depends on can be awesome to not only test these changes but also put them in production.

* Lastly, I am quite excited about SFrame that Indu is working on. It's going to solve a lot of the problems we are facing right now while letting users decide whether they use frame pointers. I can't wait for it, but I am afraid it might take several years until all the infrastructure is in place and everybody upgrades to it.

- [0]: https://web.archive.org/web/20231222054207/https://www.polar...

rwmj
0 replies
5h4m

On the third point, you have to do frame pointers across the whole Linux distro in order to be able to get good flamegraphs. You have to do whole system analysis to really understand what's going on. The way that current binary Linux distros (like Fedora and Debian) works makes any alternative impossible.

felixge
0 replies
4h57m

Great comments, thanks for sharing. The non-atomic frame setup is indeed problematic for CPU profilers, but it's not an issue for allocation profiling, Off-CPU profiling or other types off non-interrupt driven profiling. But as you mentioned, there might be ways to solve that problem.

brancz
0 replies
1h53m

Great comment! Just want to add we are making good progress on the JVM unwinder!

ReleaseCandidat
3 replies
9h46m

That's one thing Apple did do right on ARM:

The frame pointer register (x29) must always address a valid frame record. Some functions — such as leaf functions or tail calls — may opt not to create an entry in this list. As a result, stack traces are always meaningful, even without debug information.

https://developer.apple.com/documentation/xcode/writing-arm6...

microtherion
2 replies
7h25m

On Apple platforms, there is often an interpretability problem of another kind: Because of the prevalence of deeply nested blocks / closures, backtraces for Objective C / Swift apps are often spread across numerous threads. I don't know of a good solution for that yet.

felixge
1 replies
6h22m

I'm not very familiar with Objective C and Swift, so this might not make sense. But JS used to have a similar problem with async/await. The v8 engine solved it by walking the chain of JS promises to recover the "logical stack" developers are interested in [1].

[1] https://v8.dev/blog/fast-async

astrange
0 replies
3h19m

Swift concurrency does a similar thing. For the older dispatch blocks, Xcode injects a library that records backtraces over thread hops.

tkiolp4
2 replies
7h29m

Are his books (the one about Systems Performance and eBPF) relevant for normal software engineers who want to improve performance in normal services? I don’t work for faang, and our usual performance issues are solved by adding indexes here and there, caching, and simple code analysis. Tools like Datadog help a lot already.

wavemode
0 replies
2h1m

Diving into flame graphs being worthwhile for optimization, assumes that your workload is CPU-bound. Most business software does not have such workloads, and rather (as you yourself have noted) spend most of their time waiting for I/O (database, network, filesystem, etc).

And so, (as you again have noted), your best bet is to just use plain old logging and tracing (like what datadog provides) to find out where the waiting is happening.

polio
0 replies
1h51m

Profiling is a pretty basic technique that is applicable to all software engineering. I'm not sure what a "normal" service is here, but I think we all have an obligation to understand what's happening in the systems we own.

Some people may believe that 100ms latency is acceptable for a CLI tool, but what if it could be 3ms? On some aesthetic level, it also feels good to be able to eliminate excess. Finally, you should learn it because you won't necessarily have that job forever.

eqvinox
2 replies
9h29m

This doesn't detract from the content at all but the register counts are off; SI and DI count as GPRs on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

cesarb
1 replies
6h7m

[...] on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

That is, on i686 you have 7 GPRs without frame pointers, while on x86_64 you have 14 GPRs even with frame pointers.

Copying a comment of mine from an older related discussion (https://news.ycombinator.com/item?id=38632848):

"To emphasize this point: on 64-bit x86 with frame pointers, you have twice as many registers as on 32-bit x86 without frame pointers, and these registers are twice as wide. A 64-bit value (more common than you'd expect even when pointers are 32 bits) takes two registers on 32-bit x86, but only a single register on 64-bit x86."

brendangregg
0 replies
5h36m

Thanks!

codeflo
2 replies
1h48m

All of this information is static, there's no need to sacrifice a whole CPU register only to store data that's already known. A simple lookup data structure that maps an instruction address range to the stack offset of the return address should be enough to recover the stack layout. On Windows, you'd precompute that from PDB files, I'm sure you can do the same thing with whatever the equivalent debug data structure is on Linux.

loeg
0 replies
48m

It isn't entirely static because of alloca().

fsmv
0 replies
1h46m

[deleted]

rwmj
1 replies
7h13m

That was in 2012. Does it still occur on modern GCC?

There definitely have been regressions with frame pointers being enabled, although we've fixed all the ones we've found in current (2024) Fedora.

jart
0 replies
2h42m

I think so and I vaguely seem to recall -fno-schedule-insns2 being the only thing that fixes it. To get the full power of frame pointers and hackable binary, what I use is:

    -fno-schedule-insns2
    -fno-omit-frame-pointers
    -fno-optimize-sibling-calls
    -mno-omit-leaf-frame-pointer
    -fpatchable-function-entry=18,16
    -fno-inline-functions-called-once
The only flag that's potentially problematic is -fno-optimize-sibling-calls since it breaks the optimal approach to writing interpreters and slows down code that's written in a more mathematical style.

ngcc_hk
1 replies
13h15m

It said gcc. I noted the default of llvm said to default with framepounter from 2011. Is this mainly a gcc issue?

bawolff
0 replies
11h12m

It doesn't really matter what the default of the compiler is, but what distros chose.

cesarb
1 replies
6h24m

I disagree with this sentence of the article:

"I could say that times have changed and now the original 2004 reasons for omitting frame pointers are no longer valid in 2024."

The original 2004 reason for omitting frame pointers is still valid in 2024: it's still a big performance win on the register-starved 32-bit x86 architecture. What has changed is that the 32-bit x86 architecture is much less relevant nowadays (other than legacy software, for most people it's only used for a small instant while starting up the firmware), and other common 32-bit architectures (like embedded 32-bit ARM) are not as register-starved as the 32-bit x86.

IshKebab
0 replies
4h52m

That's exactly what they were saying. You're not disagreeing at all.

zzbn00
0 replies
8h24m

NiX (and I assume Guix) are very convenient for this as it is fairly easy to turn frame pointers on or off for parts or whole of the system.

tzot
0 replies
6h46m

I am not sure, but I believe -fomit-frame-pointer in x86-64 allows the compiler to use a _thirteenth_ register, not a _seventeenth_ .

titzer
0 replies
2h10m

Virgil doesn't use frame pointers. If you don't have dynamic stack allocation, the frame of a given function has a fixed size can be found with a simple (binary-search) table lookup. Virgil's technique uses an additional page-indexed range that further restricts the lookup to be a few comparisons on average (O(log(# retpoints per page)). It combines the unwind info with stackmaps for GC. It takes very little space.

The main driver is in (https://github.com/titzer/virgil/blob/master/rt/native/Nativ... the rest of the code in the directory implements the decoding of metadata.

I think frame pointers only make sense if frames are dynamically-sized (i.e. have stack allocation of data). Otherwise it seems weird to me that a dynamic mechanism is used when a static mechanism would suffice; mostly because no one agreed on an ABI for the metadata encoding, or an unwind routine.

I believe the 1-2% measurement number. That's in the same ballpark as pervasive checks for array bounds checks. It's weird that the odd debugging and profiling task gets special pleading for a 1% cost but adding a layer of security gets the finger. Very bizarre priorities.

secondcoming
0 replies
2h44m

Can they not be disabled on a per-function basis?

mikewarot
0 replies
1h34m

I started programming in 1979, and I can't believe I've managed to avoid learning about stack frames all those EBP register tricks until now. I always had parameters to functions in registers, not on the stack, for the most part. The compiler hid a lot of things from me.

Is it because I avoided Linux and C most of my life? Perhaps it's because I used debug, and Periscope before that... and never gdb?

mgaunard
0 replies
7h19m

You don't need frame pointers, all the relevant info is stored in dwarf debug data.

dsign
0 replies
8h32m

I remember when the omission of stack frame pointers started spreading at the beginning of the 2000s. I was in college at the time, studying computer sciences in a very poor third-world country. Our computers were old and far from powerful. So, for most course projects, we would eschew interprets and use compilers. Mind you, what my college lacked in money it compensated by having interesting course work. We studied and implemented low level data-structures, compilers, assembly-code numerical routines and even a device driver for Minix.

During my first two years in college, if one of our programs did something funny, I would attach gdb and see what was happening at assembly level. I got used to "walking the stack" manually, though the debugger often helped a lot. Happy times, until all of the sudden, "-fomit-frame-pointer" was all the rage, and stack traces stopped making sense. Just like that, debugging that segfault or illegal instruction became exponentially harder. A short time later, I started using Python for almost everything to avoid broken debugging sessions. So, I lost an order of magnitude or two with "-fomit-frame-pointer". But learning Python served me well for other adventures.

dap
0 replies
13h30m

Good post!

Profiling has been broken for 20 years and we've only now just fixed it.

It was a shame when they went away. Lots of people, certainly on other systems and probably Linux too, have found the absence of frame pointers painful this whole time and tried to keep them available in as many environments as possible. It’s validating (if also kind of frustrating) to see mainstream Linux bring them back.

boulos
0 replies
49m

JIT'ed code is sadly poorly supported, but LLVM has had great hooks for noting each method that is produced and its address. So you can build a simple mixed-mode unwinder, pretty easily, but mostly in process.

I think Intel's DNN things dump their info out to some common file that perf can read instead, but because the *kernels* themselves reuse rbp throughout oneDNN, it's totally useless.

Finally, can any JVM folks explain this claim about DWARF info from the article:

Doesn't exist for JIT'd runtimes like the Java JVM

that just sounds surprising to me. Is it off by default or literally not available? (Google searches have mostly pointed to people wanting to include the JNI/C side of a JVM stack, like https://github.com/async-profiler/async-profiler/issues/215).

benreesman
0 replies
9h26m

Brendan is such a treasure to the community (buy his book it’s great).

I wasn’t doing extreme performance stuff when -fomit-frame-pointer became the norm, so maybe it was a big win for enough people to be a sane default, but even that seems dubious: “just works” profiling is how you figure out when you’re in an extreme performance scenario (if you’re an SG14 WG type, you know it and are used to all the defaults being wrong for you).

I’m deeply grateful for all the legends who have worked on libunwind, gperf stuff, perftool, DTrace, eBPF: these are the too-often-unsung heroes of software that is still fast after decades of Moore’s law free-riding.

But they’ve been fighting an uphill battle against a weird alliance of people trying to game compiler benchmarks and the really irresponsible posture that “developer time is more expensive” which is only sometimes true and never true if you care about people on low-spec gear, which is the community of users who that is already the least-resourced part of the global community.

I’m fortunate enough to have a fairly modern desktop, laptop, and phone: for me it’s merely annoying that chat applications and music players and windowing systems offer nothing new except enshittification in terms of features while needing 10-100x the resources they did a decade ago.

But for half of my career and 2/3rds of my time coding, I was on low-spec gear most of the time, and I would have been largely excluded if people didn’t care a lot about old computers back then.

I’m trying to help a couple of aspiring hackers get started right now it’s a real struggle to get their environments set up with limitations like Intel Macs and WSL2 as the Linux option (WSL2 is very cool but it’s not loved up enough by e.g. yarn projects).

If you want new hackers, you need to make things work well on older computers.

Thanks again Brendan et al!

WalterBright
0 replies
12h52m

Guess I'll add it back in to the DMD code generator!

Cold_Miserable
0 replies
2h15m

Not interesting. Enter/leave also does the same thing as your save/restore rbp.

Far more interesting I recall there might be an instruction where rbp isn't allowed.