That sounds like magic, I'm not going to lie. I have to understand how this is possible.
It might be interesting to tie this into something I had a daydream about once and then never bothered to actually do: generate header files from debug info (and then possibly have some LLM tidy it up)
Actually there are a few attempts at this! Here's one for Microsoft Program Database:
https://github.com/wbenny/pdbex
As for using an LLM to tidy it up... It doesn't seem like there has been a ton of success applying LLM models to reverse engineering yet... A part of me is wondering if this will wind up being a place where the LLM architecture proves insufficient. I'm not an expert but if I had to place a bet I'd bet on diffusion models being more interesting for a lot of reverse engineering use cases. That said, it's not really the same thing, but with Binary Ninja they have a feature called Sidekick that uses an LLM to try to clean up the disassembly; I'm kind of unimpressed but maybe it is useful to somebody.
The idea is more that you use the LLM for quick guesses about behaviour/names and so on rather than actually relying on it for reverse engineering as per se.
I'll add my attempt here: https://gitlab.com/dvdkon/pdb2hpp
Its output is kind of ugly, limited by limitations of either the PDB format or Microsoft's terrible parser library, but I've successfully used it for calling functions from a proprietary DLL.
I wrote a tool a few years ago which automatically generated and inserted type-aware fuzzers for C APIs from DWARF info: https://github.com/intel/fffc
Generating headers and also mutators that you could then modify to meet type constraints was part of that.
Edit: just to add onto the LLM side, I can see this labelling anonymous structs or similar but I'm not sure that's a good idea. What might be interesting would be to try to get an LLM to verbalize/summarize known type constraints for documentation purposes.
pahole gives you compilable C header files from ELF DWARF information. LLMs seems irrelevant here: either your header files have all the types exported from the executable correctly so they are usable with the original values, or they aren't correct/complete and having an LLM make up some more doesn't help.
Ghidra also has native functionality to export its data structures, which it can create from DWARF structures (Right click -> Export to C header).
Tangentially, I've considered generating debugging symbols for the exported object files, based on the contents of the Ghidra database, in order to improve the debugging experience when using them.
I haven't implemented that feature yet because so far I've managed to get by without it. Also, it sounds like a rather deep rabbit hole to fall into and the one I'm currently inside of is big enough as it is.
How much work is it to figure out which sections of the executable to export?
Would it be realistic to be able to export a modern-ish (2008-2015) Win32 game into objects and then compile/link it into a full executable again with less than a few hours work?
How much work is it to figure out which sections of the executable to export?
As long as you do not cut across a variable or a function, you can export pretty much however you want, you don't have to follow the original object file boundaries. What to export is a separate matter and requires some knowledge about the program. Having debugging symbols makes this much easier, otherwise by the time you've made the Ghidra database accurate enough for exportation you'll usually have an idea of where's what.
Would it be realistic to be able to export a modern-ish (2008-2015) Win32 game into objects and then compile/link it into a full executable again with less than a few hours work?
About the user report in my submission, they first raised an issue in early July and by mid-August they got a fully working, functionally identical relinked executable. To be fair, the COFF exporter had a lot of bugs that needed to be fixed and the i386 analyzer needed some touch-ups, things that somebody else should hopefully won't stumble over now.
I don't know how long it would take, but unless you have debugging symbols and are really lucky it will take more than a few hours of work. A skilled reverse-engineer can probably manage to get something executing in that timeframe (even if it crashes halfway during the first loading screen), but it's one of these tasks that you won't know when it will be done until it is done.
As long as you do not cut across a variable or a function, you can export pretty much however you want, you don't have to follow the original object file boundaries.
Would it be possible to export basically the entire program at once and then slice off individual functions one by one?
Do you have any guides/examples of the
Decompilation projects, by splitting a program into multiple object files and reimplementing these Ship of Theseus-style
style project?
Would it be possible to export basically the entire program at once and then slice off individual functions one by one?
Yes. The exporters can handle whatever meaningful address selection you can throw at them, including multiple disjoint ranges within the same section. So you can keep carving holes inside your selection until nothing remains of the original program.
Do you have any guides/examples of the Ship of Theseus-style style project?
Not quite. My own decompilation project is on a hiatus due to one version tracking session too many in a row, so I only have one article on this so far [1] and the way I've done it is a bit wonky.
Another user has recently started a decompilation project [2] with a better framework than I've used in that article, but no actual decompilation has taken place there yet. Incidentally, that would also make for a good modding framework, if one decides to not write functionally identical replacement code.
[1] https://boricj.net/tenchu1/2024/05/31/part-11.html (which is humorously titled "A modding framework powered by the tears of CS101 teachers")
Yes. The exporters can handle whatever meaningful address selection you can throw at them, including multiple disjoint ranges within the same section. So you can keep carving holes inside your selection until nothing remains of the original program.
Will this also work without painstakingly reversing things in the binary, say in the case of a giant game executable?
If possible, I would be very interested in a simple tutorial that takes an arbitrary Windows executable, delinks it and replaces a single function, without all the extra steps necessary to run it on the PS1.
It might even be preferable if it worked with MingW, since I'm on Linux as well.
Will this also work without painstakingly reversing things in the binary, say in the case of a giant game executable?
You can get away with a Ghidra database that isn't accurate, as long as you know what you're doing. Basically, as long as the analyzers manage to identify all of the relocation spots inside your exportation, the rest doesn't matter that much. You can even get away with missing relocation spots inside your exportation, if you don't end up executing that code or accessing that data at run-time (if you do, then exotic undefined behavior ensues).
The most important thing here is getting references right and addresses typed as pointers (the type itself doesn't matter). I'm not going to discuss this into more details than that, because it would require a deep understanding of the internal algorithms of the extension. Any shortfall between a less-than-accurate Ghidra database and experience will be filled in by luck.
If possible, I would be very interested in a simple tutorial that takes an arbitrary Windows executable, delinks it and replaces a single function, without all the extra steps necessary to run it on the PS1.
It's essentially the same steps regardless of the platform. Select the bits you want in your object file, run the analyzer, invoke the exporter, use the linker to create a new program.
I've made my Ghidra extension as user-friendly as possible, the rest is standard native development stuff (up to the point where you hit exotic undefined behavior and can't figure it out at a glance, hopefully you're well acquainted with your debugger if that happens).
It might even be preferable if it worked with MingW, since I'm on Linux as well.
Actually, I've created a native port of a proprietary, statically-linked, Linux a.out i386 to Windows with MinGW [1] using my delinker. It was back when I didn't have a COFF object file exporter either, so it was the only toolchain for that target that could ingest ELF object files.
That being said, MinGW and MSVC are reportedly only compatible at the C ABI level. Mixing and matching different toolchains can increase the odds of something going wrong, so you're probably better off using the toolchain that the program was originally built with (hopefully it runs on Wine).
PS: remember that you are throwing your CS 101 handbook into the trashbin when you're using a delinker (and its teacher is unlikely to be of much help).
[1] https://boricj.net/atari-jaguar-sdk/2024/01/02/part-5.html
So is this a completely fool-proof process? Ie i'm asking if it's guaranteed to succeed or if the analysis is conservative. Ie if some piece/datum/feature is missing in the ELF then the delinking will fail?
Based on what it seems that you're asking, it is not, and cannot be, a foolproof process. consider
int getSpecialArrayElement(char *array, uint64 key) {
i = computeOffset(key);
return array[i];
}
Compute offset can be arbitrarily complex (and probably deliberately hard to analyze if obfuscation is desired). There's nothing that prevents this function from accessing arbitrary locations in memory. You don't know if this will be accessing symbols that are already defined in memory by the linker short of exhaustively trying all possible inputs (and computeOffset may have turing traps for that).During delinking we only really care about relocation spots, the actual algorithms of the program are mostly irrelevant.
Assuming it doesn't reference any other global data and only contains relative jumps and branches, computeOffset() won't have any mandatory relocation spots [1] and therefore can be put into an object file as-is. Similarly, getSpecialArrayElement() would only have a relocation for the address of computeOffset() because array is supplied as a parameter, not as a global variable. Furthermore, any data allocated on the heap is transparent during delinking.
From my experience, "normal" everyday programs written in high-level languages typically don't contain nasty surprises while trying to delink them. That is not to say that obfuscated programs won't cause problems, but I haven't attempted delinking one so far [2].
[1] PC-relative relocations can be trimmed if the source and target are part of the same continuous address range being exported, because in that case the value won't change.
[2] I do have a pet peeve against developers who cast raw integer constants as pointers on MIPS, because the code sequence may be different than what the HI16/LO16 relocation pattern can tolerate and it requires binary patching to fix up (LUI/ADDIU versus LUI/ORI). If the integer was a multiple of 65536, is passed directly as a parameter to a function call and the compiler elided the second instruction then it can't be fixed in place and must be worked around some way, if the original value can't be kept (like the address for the scratchpad on the PlayStation for example, as long as you stay on that platform or can map memory there).
So is this a completely fool-proof process?
That's... complicated to answer.
My analyzers rely on an accurate Ghidra database, at least for the parts you want to export. While I've put a fair amount of effort into logging the various issues than can crop up which require fixing, they can't see what isn't there. In particular, missing references and truncation of variables won't be detected and will result in exotic undefined behavior.
There are ways to track down some of these issues. The best I've found so far is to relink the executable at a different base address and making sure that the original program's address ranges are unmapped ; that should lead to segmentation faults when absolute relocation spots are missed that can be debugged (but that only works if your target has a MMU). Truncated variables are very tricky to troubleshoot (especially if you don't suspect it) since it's the memory following the truncated variable that gets corrupted. An integer that is mistaken for a pointer can also be very tricky to track down, as the integer's value will vary depending on the address the target symbol gets, leading to erratic program behavior (that's especially an issue for a program loaded very low in the address space).
That being said, if the Ghidra database is accurate enough and you export back to the same object file format used originally and you subsequently use it onto the same platform with the same toolchain, you can delink megabytes of program code and data successfully. I consider that if the linker did it, then it should be possible to undo it.
Now, if you start cross-delinking to something that doesn't match the original program's platform and toolchain (like delinking from a Linux i386 ELF executable into a COFF object file and using it with a i386 Windows toolchain) then it's another story. If the exporter can express the relocations then you might end up with a working relocatable object file, but you'll still have potentially mismatched ABIs to contend with. It can be done, but that's not something I would recommend as a first project.
TL;DR Depending on what you do and the accuracy of the Ghidra database, it can range from "it just works" all the way to praying to Cthulhu for mercy.
This looks fantastic and is relevant to some game modding ideas I've had. I love your blog series about decompiling Tenchu too. Thank you for releasing this stuff!
Thanks!
I should get back to this project one of these days. I did one version tracking session too many in a row and had to take a break, plus that delinking side-quest keeps snowballing out of control.
This sounds very interesting. And is tempting me to delve back into a game reverse engineering project I abandoned a few years back.
Do you have a fully worked example of how to use this and then how to make use of its output? Would love to see an end-to-end walkthrough.
There are links to various case studies on my blog inside the README of the repository.
https://github.com/boricj/ghidra-delinker-extension/blob/mas...
This looks really cool.
I don't have a real use for it in my current work but in a past life it would have been so useful.
Hopefully I'll have some time / opportunity to try it out soon.
"... past life ..."?
Are you some kind of a ressurectionist?
Oh, great to see this here. I think this is an extremely cool project, and I helped to add MS COFF support. (P.S.: I will note that my initial PR was notably worse than the ELF support that was already present, so if you run into problems with it... probably my fault :P I can see it is being improved, though.) That said, I haven't done anything big with it yet. The most fun I had was delinking a Hello World executable compiled with Visual Studio 2003, relinking it to Linux x86 with GCC+glibc, and then relinking that to MinGW+msvcrt again. Doing anything larger than hello world is a bit beyond me yet, though, in part because I'm actually a pretty big n00b when it comes to Ghidra and haven't even really figured out a good way to select the ranges for delinking from a large binary. I should've probably asked someone by now, but oh well. :)
Coincidentally, a derivation for this just got merged into Nixpkgs earlier today, so if you're using NixOS unstable it's possible to install it using ghidra.withExtensions; it is under ghidra-extensions.ghidra-delinker-extension. Only one problem: There was a new version released a few days ago and I didn't rebase my PR, so it is out of date. I will try to push an update soon.
I'm actually a pretty big n00b when it comes to Ghidra and haven't even really figured out a good way to select the ranges for delinking from a large binary.
One way to keep track of things to delink is to use folders and fragments inside a program tree. For example, I have a Ghidra program where I've figured out the name and ranges of the various object files that originally made up the executable. These folders or fragments can then be selected as a whole with right-click > Select Addresses.
The relocation synthesizer analyzer and the exporter can also be scripted, either independently or using the program's tree manager. This removes the need to select by hand the ranges you want as well as invoking manually the analyzer and the exporter.
See also: objcopy.
While objcopy can do many things, it can't undo the work of the linker. If relocations aren't unapplied and a new relocation table generated, these spots inside the new object file will reference the original program's address space, leading to some exotic undefined behavior.
Delinking is a subject with very few resources online, but there are a couple of other tools for it out there:
- https://github.com/endrazine/wcc
- https://github.com/jonwil/unlinkerida
- https://github.com/jnider/delinker
Previous:
Show HN: A Ghidra extension that turns programs back into object files - https://news.ycombinator.com/item?id=38852362 - Jan 2024 (4 comments)
This seems cool. Maybe one day it could help people take apart existing programs and use them how they want in a more accessible way. There is research like Meta's LLM Compiler and decompiling with LLM's, breaking apart programs and replacing parts of them would be an interesting, data rich space for LLM's to explore and maybe self-improve at! For people training, there are lots of interesting tokens lurking there, and interesting things to do with them!
It is definitely not going to be easy to do for every CPU architecture, but it's not as ridiculous as it seems. Basically, the difference between an "object file" and an executable file or shared library is not very large. Often times, except for Microsoft platforms, they are in the same actual file format, e.g. ELF.
The big thing is relocations. Object files have granular relocations that executable files don't; at least on Windows, executable images just have minimal relocations that point to addresses of code and data that will need to be fixed up if the executable is relocated, but the executable image itself is only able to be relocated with regards to its image base address as a whole unit. In contrast, object files contain symbol-level relocations. To be able to accurately reconstruct this information, you need to annotate the disassembly with somewhat accurate information about symbols.
The other big difference with an object file is well, it is not linked. None of the symbols are "resolved". This is particularly easy to fix actually: during delinking if the symbol is outside of the current scope then it just needs to be replaced with an unresolved symbol, pretty much. Then when relinking, another object file or library needs to provide the symbol so that it can be linked back up.
(There are a few smaller differences too, like the lack of an entrypoint, but it really isn't a whole lot of significance.)
This means that the boundaries for which you carve object files out of an executable image or shared object is actually completely arbitrary. It obviously was segmented into translation units when it was originally compiled, but nothing really cares about those boundaries at the linking stage. (Of course, you probably want to try to figure it out if possible, since it will probably be very hard to do a matching decompilation if your object boundaries are incorrect.)
Take this with a grain of salt as despite having literally worked on this problem I feel like I might be messing up some of the details a bit. I've been meaning to write a blog post about object files, though there are actually a couple of good ones already floating around.
On 32-bit and 64-bit Windows (both NT and 9x), EXE, DLL and OBJ files are all COFF. EXE/DLL files stick an MS-DOS stub and the "PE\0\0" header before the COFF header, OBJ files start with the COFF header directly.
This is different from DOS, 16-bit Windows and OS/2, where the standard object file format (Intel OMF) was completely unrelated to the executable formats (MZ for DOS; NE for 16-bit Windows and OS/2 1.x)
Interestingly, I recently tried to dive in and figure out why exactly Microsoft went with COFF. It seems like Intel OMF was adapted all the way up through OMF386, and Microsoft switched to COFF around 1989, near the beginning of the NT OS/2 project. Fascinatingly, it seems like a big criteria that Microsoft had factored in was that tools for COFF already existed for the Intel i860 RISC processor, so early on in NT's life they seemed to think that it was going to be a lot more important.
That said, I do not consider PE to be similar enough to COFF to warrant calling them the same format. Aside from having completely different headers, PE has many additional structures that don't exist in COFF, and even the structures that are shared are fairly different between PE and COFF. PE barely uses anything from COFF, and it doesn't even use it the way that COFF does. Of course ELF objects also have some differences between object file and shared object/executable, but it's at least very clearly the same format the whole time, just utilized a bit differently.
Can you expound on this? Why are the boundaries not important at the linking stage - aren't you linking the wrong code then? Or did I not understand the point here?
The boundaries aren't important during delinking, as long as you do not cut across a variable or function. To the linker, sections are just arrays of bytes. It doesn't care what's inside those, all it has to do is lay them out in memory and apply relocations.
How you decide to slice up the original program is up to you. It doesn't have to follow the boundaries from the original object files.
To keep things short, object files are made of three parts: relocatable section bytes, relocation tables and a symbol table.
When a linker is invoked to generate an executable from a bunch of object files, it will lay out their sections in memory, compute the addresses of the symbols in the virtual address space and apply the relocations based on the final addresses of the symbols onto the section bytes.
The trick to delinking is figuring out where those relocations were applied in order to undo them and get back relocatable bytes. Then, you create relocation tables based on what you've just undone as well as a symbol table, package it all and you'll get an object file.
The really tricky part is the analysis for spotting the relocation spots. I'm leveraging Ghidra to do the bulk of the work, but it still requires some work to convert references into relocation spots (fairly easy on x86, nightmarishly difficult on MIPS) as well as collecting all the required data and serializing the object file itself, hence this extension to automate all of that.