A pretty good overview of open source solutions in the space.
Missing out on one of the most useful areas for tracing which is time travel debugging. There are a number of interesting solutions there taking advantage of hardware trace, instrumentation, and deterministic replay. Even better when you get full visualization integration so you can do something like zoom in from a multiple minute trace onto a suspicious 200 ns function and then double click on it which will then backstep to that exact point in your program with the full reconstruction of memory at that time so you can debug from that point.
Is there a time traveling debugging solution for Java?
Not that I am aware of. They phase in and out of existence every so often because developing the technology is expensive and requires constant maintenance, but nobody wants to pay for tools so they never catch on with enough resources to stay maintained.
As byefruit says above - we (undo.io) sell a Java Time Travel Debugger.
If anybody wants to try it, they should get in touch with us.
Our Java tech is based on an underlying record/replay engine that works at the level of machine instructions / syscalls to record the entire process. On top of that we've added the necessary cleverness to show what that means at Java level (so normal source-level debugging works).
That's different to e.g. Chronon, which I think was a pure Java solution: https://blog.jetbrains.com/idea/2014/03/try-chronon-debugger... It had some flexibility (e.g. only record certain classes) but at the cost of quite considerable slowdown and very large storage requirements.
This would be heavily tied to the JVM you’re using, no? Do you have to keep updating this as it evolves?
The short answer is yes - but not as tightly as you'd think. We don't need a deep awareness of what the JVM is doing, e.g. its internal data structures are largely opaque to us.
When we need to reconstruct state we always have the option of time travelling the process and re-executing to drill down on the details, though that's only required when you're replaying a recording.
(the result is that it's quite feasible to update to new JVMs and to support multiple at once)
Hmm, so what do you do to answer questions like "what code corresponds to this address" or "what object is this allocation"? Run the recording, ask the JVM itself using its introspection interfaces in your replay by forking it?
At lower optimisation levels there's a register allocated by the JVM to refer back to the bytecode, which makes things easy. In principle they could change that with a JVM revision - but in practice they don't, so it's an easy cheat.
We have some ability to walk data structures and the re-compute the program's behaviour by other means, which I probably shouldn't get into here. I think we could fall back on that more-or-less completely if we couldn't retrieve the bytecode pointer directly.
The fact the JVM introduces Safe Points to help it transition between optimisation levels is quite helpful!
Our original intention was to always fork a copy of the JVM back in time to handle Java debug protocol requests but that turned out to be painful and, thankfully, also unnecessary.
Ah, did not realize you all at Undo did a Java implementation as well. I knew about Chronon which was probably the most well-known Java solution (as much as that means) during that spate of new time travel debuggers at the time, but when I looked it up again for my comment it appeared to be defunct after being largely unmaintained for years.
I think undo have one: https://undo.io/products/java
Do you know of anyone who's built that kind of time travel debugging with a trace visualization in the open outside of Javascript? I know about rr and Pernosco but don't know of trace visualization integration for either of them, that would indeed be very cool. I definitely dream of having systems like this.
At undo.io we're interested in using our time travel capability beyond conventional time travel debugging - a recording file contains everything the program did, without any advance knowledge of what you need to sample, so there's a lot of potential to get other data out of it.
I just read your post and don't think it would take much to integrate with some of the visualisations you posted about, as a first step.
We've played around in the past with a sampling profiler (code here, requires a copy of our product to be useful though it could easily port to rr): https://github.com/undoio/addons/tree/master/sample_function... which can output in a format understood by Brendan Gregg's flame frames (https://www.brendangregg.com/flamegraphs.html)
But that's not quite the kind of tracing you're talking about. We also built a printf-style interface to our recording files, which seems closer: https://docs.undo.io/PostFailureLogging.html
Something like that but outputting trace events that can be consumed by Perfetto (say) would not be so hard to add. If we considered modifying the core record/replay engine then even more powerful things become possible.
I've seen undo.io several times at cppcon. I've been throughly impressed with the demonstrations at the conference and came to this thread specifically to recommend undo.io. I was particularly impressed this year by a demonstration of debugging stack smashing -- that's something I recently worked around stack smashing in protobuf which happens before `main()` even starts. It seems perfect for undo.io to help debug :)
I'm still waiting on the keyserver to be able to run in Kubernetes though
I'm glad you liked it - and that's useful feedback for other demos we give!
I believe support for that is on the way. If you're already in touch then I imagine you'll get an announcement soon.
Tomorrow Corp does something like this in a variant of c++. But I am not sure it’s very open.
https://youtu.be/72y2EC5fkcE
Green Hills Software TimeMachine + History for C and C++: https://www.ghs.com/products/MULTI_IDE.html
No particularly good publicly visible documentation of the functionality, but it does that and is a publicly purchasable product.
They also had TimeMachine + PathAnalyzer from the early 2000s which was a time travel debug with visualization solution, but they were only about as integrated as most of the solutions you see floating around today.
how long can you time-travel?
is this something like https://www.reddit.com/r/ruby/comments/15o9hc1/timetraveling... ?
Conceptually similar in that you can decide after-the-fact what state you want to see.
But Time Travel Debugging applies that to everything in the program, not just log statements - all function calls, variables, memory locations, etc can be reconstructed after the fact without having to log them explicitly.
Oh, and regarding how long - it depends how long it takes to fill the circular buffer of non deterministic behaviour.
Serious compute bound workloads can run days with a gigabyte of non deterministic event log. Serious IO bound workloads burn it much faster.
For a rule of thumb, think of it consuming a few MB per second, so the length of the time travel is limited by how much of that you can store.