All my favorite tracing tools

A pretty good overview of open source solutions in the space.

Missing out on one of the most useful areas for tracing which is time travel debugging. There are a number of interesting solutions there taking advantage of hardware trace, instrumentation, and deterministic replay. Even better when you get full visualization integration so you can do something like zoom in from a multiple minute trace onto a suspicious 200 ns function and then double click on it which will then backstep to that exact point in your program with the full reconstruction of memory at that time so you can debug from that point.

Is there a time traveling debugging solution for Java?

Not that I am aware of. They phase in and out of existence every so often because developing the technology is expensive and requires constant maintenance, but nobody wants to pay for tools so they never catch on with enough resources to stay maintained.

As byefruit says above - we (undo.io) sell a Java Time Travel Debugger.

If anybody wants to try it, they should get in touch with us.

Our Java tech is based on an underlying record/replay engine that works at the level of machine instructions / syscalls to record the entire process. On top of that we've added the necessary cleverness to show what that means at Java level (so normal source-level debugging works).

That's different to e.g. Chronon, which I think was a pure Java solution: https://blog.jetbrains.com/idea/2014/03/try-chronon-debugger... It had some flexibility (e.g. only record certain classes) but at the cost of quite considerable slowdown and very large storage requirements.

This would be heavily tied to the JVM you’re using, no? Do you have to keep updating this as it evolves?

The short answer is yes - but not as tightly as you'd think. We don't need a deep awareness of what the JVM is doing, e.g. its internal data structures are largely opaque to us.

When we need to reconstruct state we always have the option of time travelling the process and re-executing to drill down on the details, though that's only required when you're replaying a recording.

(the result is that it's quite feasible to update to new JVMs and to support multiple at once)

Hmm, so what do you do to answer questions like "what code corresponds to this address" or "what object is this allocation"? Run the recording, ask the JVM itself using its introspection interfaces in your replay by forking it?

At lower optimisation levels there's a register allocated by the JVM to refer back to the bytecode, which makes things easy. In principle they could change that with a JVM revision - but in practice they don't, so it's an easy cheat.

We have some ability to walk data structures and the re-compute the program's behaviour by other means, which I probably shouldn't get into here. I think we could fall back on that more-or-less completely if we couldn't retrieve the bytecode pointer directly.

The fact the JVM introduces Safe Points to help it transition between optimisation levels is quite helpful!

Our original intention was to always fork a copy of the JVM back in time to handle Java debug protocol requests but that turned out to be painful and, thankfully, also unnecessary.

Ah, did not realize you all at Undo did a Java implementation as well. I knew about Chronon which was probably the most well-known Java solution (as much as that means) during that spate of new time travel debuggers at the time, but when I looked it up again for my comment it appeared to be defunct after being largely unmaintained for years.

I think undo have one: https://undo.io/products/java

Do you know of anyone who's built that kind of time travel debugging with a trace visualization in the open outside of Javascript? I know about rr and Pernosco but don't know of trace visualization integration for either of them, that would indeed be very cool. I definitely dream of having systems like this.

At undo.io we're interested in using our time travel capability beyond conventional time travel debugging - a recording file contains everything the program did, without any advance knowledge of what you need to sample, so there's a lot of potential to get other data out of it.

I just read your post and don't think it would take much to integrate with some of the visualisations you posted about, as a first step.

We've played around in the past with a sampling profiler (code here, requires a copy of our product to be useful though it could easily port to rr): https://github.com/undoio/addons/tree/master/sample_function... which can output in a format understood by Brendan Gregg's flame frames (https://www.brendangregg.com/flamegraphs.html)

But that's not quite the kind of tracing you're talking about. We also built a printf-style interface to our recording files, which seems closer: https://docs.undo.io/PostFailureLogging.html

Something like that but outputting trace events that can be consumed by Perfetto (say) would not be so hard to add. If we considered modifying the core record/replay engine then even more powerful things become possible.

I've seen undo.io several times at cppcon. I've been throughly impressed with the demonstrations at the conference and came to this thread specifically to recommend undo.io. I was particularly impressed this year by a demonstration of debugging stack smashing -- that's something I recently worked around stack smashing in protobuf which happens before `main()` even starts. It seems perfect for undo.io to help debug :)

I'm still waiting on the keyserver to be able to run in Kubernetes though

I was particularly impressed this year by a demonstration of debugging stack smashing

I'm glad you liked it - and that's useful feedback for other demos we give!

I'm still waiting on the keyserver to be able to run in Kubernetes though

I believe support for that is on the way. If you're already in touch then I imagine you'll get an announcement soon.

Tomorrow Corp does something like this in a variant of c++. But I am not sure it’s very open.

https://youtu.be/72y2EC5fkcE

Green Hills Software TimeMachine + History for C and C++: https://www.ghs.com/products/MULTI_IDE.html

No particularly good publicly visible documentation of the functionality, but it does that and is a publicly purchasable product.

They also had TimeMachine + PathAnalyzer from the early 2000s which was a time travel debug with visualization solution, but they were only about as integrated as most of the solutions you see floating around today.

how long can you time-travel?

is this something like https://www.reddit.com/r/ruby/comments/15o9hc1/timetraveling... ?

Conceptually similar in that you can decide after-the-fact what state you want to see.

But Time Travel Debugging applies that to everything in the program, not just log statements - all function calls, variables, memory locations, etc can be reconstructed after the fact without having to log them explicitly.

Oh, and regarding how long - it depends how long it takes to fill the circular buffer of non deterministic behaviour.

Serious compute bound workloads can run days with a gigabyte of non deterministic event log. Serious IO bound workloads burn it much faster.

For a rule of thumb, think of it consuming a few MB per second, so the length of the time travel is limited by how much of that you can store.

The author mentions dtrace in passing. If you're into "load bearing rants", check out bcantrill's recent rant on bpftrace silently losing events and why dtrace won't do that.

I haven't actually used bpftrace myself, only BCC. I can totally imagine it being more janky than DTrace, BCC is pretty janky even if I also think it's cool. In my eBPF tracing framework I had to add special handling counters to alert you if it ever lost any events, plausible bpftrace didn't do that.

I think if you're working mostly with tracing/sampling specific applications you'll be more of a BCC person, while if you're hired to diagnose problems in a wide variety of applications then you might learn to like bpftrace more.

Dtrace is a generation behind eBPF. There's a reason why the tracing community has moved on to eBPF and is no longer interested in dtrace.

That's an absurd comment: eBPF and DTrace exist on orthogonal systems, and most using eBPF have never even used DTrace, let alone "moved on" from it. The systems are really quite different, and have different design centers; for the use case of instrumenting the system for purposes of understanding it, there are many regards in which eBPF remains behind DTrace -- one of which I elaborated on in the rant to which the parent is referring.[0]

[0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m25s

The "you can feel like lights flickering on" one?

What kind of events were being lost, and under what conditions? I'd like to see if it can be fixed.

If you work on Windows applications, check out Event Tracing for Windows (ETW). The best place to start is Bruce Dawson’s blog:

https://randomascii.wordpress.com/2015/09/24/etw-central/

Isn't ETW a total trainwreck from a developer usability standpoint? Or so my colleagues (and the interwebs) tell me.

I don’t think so. The ability to keep drilling down deeper and deeper and the amazing sort and grouping functionality make it supremely useful.

The other tool I didn’t mention is WinDbg. In my opinion, it’s the greatest debugger on any platform.

Finally found it: https://caseymuratori.com/blog_0025

Ah, I see. When you said developer usability I was thinking of the Windows Performance Analyzer UI, not the Windows Events API.

That’s a great blog post. Thanks for sharing it.

In my opinion, the best way to interact with ETW is through DTrace. Microsoft’s GUIs like WPA-Xperf are so buggy and unreliable that using them feels utterly futile. DTrace on Windows on the other hand is very usable.

If you're working with ETW traces, SuperLuminal [0] (no affiliation just a happy customer) is leaps and bounds ahead of the built-in ETW viewer.

[0] https://superluminal.eu/

Some great tools in here, thanks!

Me and my team have been working on building an IDE plugin to add the powers of a traditional debugger to your apps running in production - without the overheads and redeployments associated with a traditional debugger.

People use it to analyze arbitrary variables during runtime to understand what is happening in their code. We charge $0 for it.

You can get started here: https://docs.ctrlb.ai/getting-started-in-2-minutes

I wanted to correlate packets with userspace events from a Python program, so I used a fun trick: Find a syscall which has an early-exit error path and bindings in most languages, and then trace calls to that which have specific arguments which produce an error.

Wow. This is some great engineering. Obviously that's what you'd do, but I'd never think of it in a thousand years!

I wish the industry had a better answer for deterministically profiling the execution cost of JavaScript. Attempts were made in Chromium by hooking into Linux perf, but that change has since been removed.

If anyone has any tips on how to trace JavaScript (not just profile by time, but deterministically measure the cost of it in CI), I'd love to hear tips!

I wrote Spall, one of the lightweight profilers mentioned in the post. I loved the author's blogpost on implicit in-order forests, it was neat to see someone else's take on trees for big traces, pushed me to go way bigger than I was originally planning!

Thankfully, eytzinger-ordered 4-ary trees work totally fine at 165+ fps, even at 3+ billion functions, but I like to read back through that post once in a while just in case I hit that perf wall someday.

Working on timestamp delta-compression at the moment to pack events into much smaller spaces, and hopefully get to 10 billion in 128 GB RAM sometime soon (at least for native builds of Spall).

Thanks for the kick to keep on pushing!

What a great way to recruit! The ending pitch to join Tristan at Anthropic, if I were competent enough in this area, is very alluring! Tristan does a great job covering the content about the types of things one would be working on.

p.s. I think the blog post could use more screengrabs of the traces. Great first pass at it though, and screengrabs can be added over time!