This is a fascinating idea, but (honest question, not a judgement) would the output be reliable? It would be hard to identify hallucinations since recompiling could produce different machine code. Particularly if there is some novel construct that could be a key part of the code. Are there ways of also reporting the LLMs confidence in sections like this when running generatively? It’s an amazing idea but I worry it would stumble invisibly on the parts that are most critical. I suppose it would just need human confirmation on the output
Will be interesting to see is there is some way to train a decompilation module based on who we know developed the application and use their previous code used as training. For example: Super Mario 64 and Zelda 64 were fully decompiled and a handful of other N64 games are in the process. I wonder if we could map which developers worked on these two games (maybe even guess who did what module) and then use that to more easily decompile any other game that had those developers working on it.
If this gets really good, maybe we can dream of having a fully de-obfuscated and open source life. All the layers of binary blobs in a PC can finally be decoded. All the drivers can be open. Why not do the OS as well! We don't have to settle for Linux, we can bring back Windows XP and back port modern security and app compatibility into the OS and Microsoft can keep their Windows 11 junk...at least one can dream! :D
If this gets really good, maybe we can dream of having a fully de-obfuscated and open source life. All the layers of binary blobs in a PC can finally be decoded. All the drivers can be open. Why not do the OS as well!
Decompilers already exist and are really good. If an LLM can do the same as these existing compilers, you can bet the lawyers will consider it an equivalent process. The main problem is legal/political, not technical.
I don't know if I'd call the output of modern de-compilers "very good", not for native code anyway. They're just a little better than raw disassembly. Even state of the art de-compilers struggle to reconstruct control flow, distinguish data from code, identify the presence of a variable, let alone its type, and they fundamentally lack context. If a LLM could be used even just to reliably reconstruct symbol information it would be game-changing.
I wrote my bachelor thesis on something tangential — basically, some researchers found that it was possible in some very specific circumstances to train a classifier to do author attribution (i.e. figure out who wrote the program) based just on the compiled binaries they produced. I don’t think the technique has been used for anything actually useful, but it’s cool to see that individual coding style survives the compilation process, so much so that you can tell one person’s compiled programs apart from another’s.
Do you mean the whole binary or just the text segment/instructions?
Because I think this gets a lot easier if you can look at the symbol table, strings, and codesigning certificate.
I doubt the code would be identifiable. It wouldn’t be the actual code written, but it would be very similar. But I assume many elements of code style would be lost, and any semblance of code style would be more or less hallucinated.
if it can make test from the decompiled code, we could reimplement it with our code style. might be cool to have some bunch of llms working together with feedback loops.
Hey, I am working on my own LLM-based decompiler for Python bytecode (https://github.com/kukas/deepcompyle). I feel there are not many people working on this research direction but I think it could be quite interesting, especially now that longer attention contexts are becoming feasible. If anyone knows a team that is working on this, I would be quite interested in cooperation.
Is there a benefit from using an LLM for Python byte code? Python byte code is high enough level that it's possible to translate it directly to source code from my experience.
My motivation is that the existing decompilers work only for Python versions till ~3.8. Having a model that could be finetuned with every new Python version release might overcome the need for highly specialized programmer that is able to update the decompiler to be compatible with the new version.
It is also a toy example for me to set up a working pipeline and then try to decompile more interesting targets.
Why Python? First, python is a language with a large open-source library. Second, I do not think it is used for software that is distributed as binaries?
Closed-source python exists, and it is frequently distributed in compiled binaries (especially in mediocre malware).
As a (supposedly) non-malicious example, the "Nightshade" watermarking tool is distributed as closed-source pre-compiled Python https://nightshade.cs.uchicago.edu/downloads.html
There is [PyLingual](https://pylingual.io/), but it is not open source unfortunately. I am not sure if it is also LLM based.
I found lots of decompilation work are conducted on C. It seems not much python projects are compiled into binaries.
As someone who is actively developing a decompiler to reverse engineer old DOS 8086 video games, I'd have a hard time trusting an LLM to do this correctly. My standard is accurate semantics lifting from Machine Code to C. Reversing assembly to C is very delicate. There are many patterns that tend to usually map to obvious C constructs... except when they don't. And that assumes the original source was C. Once you bump into routines that were hand-coded assembly and break every established rule in the calling conventions, all bets are off. I'm somewhat convinced that decompilation cannot be made fully-automatic. Instead a good decompiler is just a lever-arm on the manual work a reverser would otherwise be doing. Corollary: I'm also somewhat convinced that only the decompiler's developers can really use it most effectively because they know where the "bodies are buried" and where different heuristics and assumptions were made. Decompilers are compilers with all the usual engineering challenges, plus a hard inference problem tacked on top.
All that said, I'm not a pessimist on this idea. I think it has pretty great promise as a technique for general reversing security analysis where the reversing is done mostly for "discovery" and "understanding" rather than for perfect semantic lifting to a high-level language. In that world, you can afford to develop "hypotheses" and then drill down to validate if you think you've discovered something big.
Compiling and testing the resulting decompilation is a great idea. I do that as well. The limitation here is TEST SUITE. Some random binary doesn't typically come with a high-coverage test suite, so you have to develop your own acceptance criterion as you go along. In other words: write tests for a function whose computation you don't understand (ha). I suppose a form of static-analysis / symbolic-computation might be handy here (I haven't explored that). Here you're also beset with challenges of specifying which machine state changes are important and which are superfluous (e.g. is it okay if the x86 FLAGS register isn't modified in the decompiled version, probably yes, but sometimes no).
In my case I don't have access to the original compiler and even if I did, I'm not sure I could convince it to reproduce the same code. Maybe this is more feasible for more modern binaries where you can assume GCC, Clang, MSVC, or ICC.
At any rate: crazy hard, crazy fun problem. I'm sure LLMs have a role somewhere, but I'm not sure exactly where: the future will tell. My guess is some kind of "copilot" / "assistant" type role rather than directly making the decisions.
(If this is your kind of thing... I'll be writing more about it on my blog soonish...)
I would devise a somewhat loose metric. Consider you assign a percentage as to how much a binary is disassembled. As in, 0% means the binary is in assembly and 100% means the whole binary is now C code. The ideal decompiler would result in 100% for any binary.
My prediction is that this percentage will increase with time. It would be interesting to construct data for this metric.
It is important to define the limitations of using LLMs for this endeavor. I would like to emphasize your subtle point. The compiler used for the original binary may not be the same as the one you use. The probability of this increases with time, as compilers improve or the platform on which the binary runs becomes obsolete. This is a problem for validation, as in you cannot directly compare original assembly code with assembly after compiling C code (that came from decompiling).
Perhaps assembly routines could be given a likelihood, as in how sure the LLM is that some C code maps to assembly. Then, routines with hand-coded assembly would have a lower likelihood.
Could you expand on how this metric would be practically defined?
The problem isn’t lifting to C code, but rather “good C code”. For example you can do a 1-to-1 translation on each assembly instruction to C code that will do the same Machine state changes. This is not usually why you want, as it comes with a lot of extra cruft. When people think “decompiler” they think of n output that looks like what they would personal write. But that’s very Ill-defined. And, personally idk how one would define such a thing.
I'm curious about the decompilation process.
I know compiling is a lossy process and optimization can make things even harder to remap, but if an LLM can recognize patterns correctly, chunk or classify each routine or even give a more palatable overview of what a part of the code is meant to do step by step, it becomes closer to what HexRays offer with their assembly to pseudo code translator. And from that point, it can make serviceable translations to real world languages.
LLMs won't replace an engineer, but maybe they can help romhackers in identifying bugs or how some values are calculated by a game.
Thanks. We acknowledge that an LLM cannot completely replace human expertise in decompilation, much like GPT-4 has not achieved true human-like intelligence. However, the aim of our llm4decompile project is to do something like GPT-4, and offer assistance and enhance productivity in the decompilation process.
As for test suites, it's one of our project's main challenges—figuring out which functions satisfy the expectations of reverse engineers, how to autonomously produce high-coverage test suites, and how to objectively qualify decompilation outcomes without relying solely on human judgment. Looking forward to your advices!
Can this be used for deobfuscation of code? I really hadn’t thought about LLM being a tool during reverse engineering.
I did a little experiment with this here:
https://neugierig.org/software/blog/2023/01/compiling-advent...
Thanks! The model is trained only for O0-3, not support for obfuscation. There's still a long way for llm to go.
Big LLMs like GPT-4 (and even GPT 3.5 Turbo) can be directly used to beautify obfuscated/minified JS, see e.g. https://thejunkland.com/blog/using-llms-to-reverse-javascrip... and https://news.ycombinator.com/item?id=34503233
I have tried feeding some of the foundation models obfuscated code from some of the competitions.
People might think that the answers would be in the training data already, but I didn't find that to be the case. At least in my small experiments.
The model's did try to guess what the code does. They would say things like, "It seems to be trying to print some message to the console". I wasn't able to get full solutions.
It's definitely worth more research, not just as a curiosity, but these kinds of problems are good proxies for other tasks and also excellent benchmarks for LLMs particularly.
The problem is interesting in at least two aspects. First, an ideal decompiler would eliminate proprietary source code. Second, the abundant publicly available C code allows you to simply make a dataset of paired ASM and source code. There is also a lot of variety with optimization level, compiler choice, and platform.
What is unclear to me is: why did the authors fine-tune the DeepSeek-Coder model? Can you train an LLM from zero with a similar dataset? How big does the LLM need to be? Can it run locally?
Ideal decompilers do not exist. In some sense they can never exist as compilers are lossy, but even taking a liberal view of “high level understanding of the resulting code” this is essentially the AGI for computer security. Nobody has come close to it!
Most proprietary code runs behind firewalls and won't be affected by this one way or another.
It's basically always better to start training with a pre-trained model rather than random, even if what you want isn't that close to what you start with.
Thanks! Training a language model from scratch is data-intensive; Llama2 was developed using 2 trillion tokens, while our dataset is around 4 billion.
The appropriate size of the model is not straightforward to determine. In our experiments, a 7 billion parameter model achieved 21% executability compared to just 10% for a 1 billion parameter model. However, their re-compilability rates are quite similar.
To run a 1 billion parameter model, a minimum of 2GB GPU memory is necessary, which is feasible on most GPUs. A 7 billion parameter model needs 14GB, suitable for GPUs like the 3090/4090 series. For running a 33 billion parameter model, an A100 GPU (80G) would be the single card option, although technically a MacBook could work, but you won't really want to use it.
I assume it's related to the cost of training vs fine-tuning. It could be also a starting point to validate an idea.
It's interesting the 6b model outperforms the 33b model. I wonder if it means the 33b model needs more training data? It was pretrained on ~1 million C programs, compared to DeepSeek-Coder, which was trained on 2 trillion tokens, which is a few orders of magnitude more data.
I'm also curious about how this compares to non-LLM solutions.
on ~1 million C programs, compared to [...] 2 trillion tokens, which is a few orders of magnitude more data.
Is that comparable like that? This would assume that the average C program of the set is orders (plural) of magnitude less than 2m tokens in size, which could indeed be true but sounds like an optimistic assumption.
Yes, it's not easy to train a 33B model. An interesting point is, naive fine-tuning, which means if one followed the standard way to fine-tune the model. Training a larger model is tricky, not only the data amount matters, everything like data cleaning, learning rate, and decays will affect the final performance.
This has been the dynamics with LLMs for awhile. The majority of LLMs are massively undertrained. 7b models are the least "undertrained" mainstream models we have, hence why they have proliferated so much among the LLM fine-tuning community.
I think using higher-level input, e.g. the intermediate language like RzIL[1] could produce better results and is more scalable for making such decompliation multiplatform. As RzIL text form resemples SMT, it should make LLM easier to "understand" the meaning. Moreover, information from binary such as symbols, signatures, debug information (DWARF, PDB, etc) could enrich the result further. You can download Rizin[2] and try for yourself by calling `aaa` then `plf` for any chosen functions for architectures supported by RzIL. See the example excerpt for a function with this disassembly:
│ │ 0x140007e51 movsd qword [rdi + 0x50], xmm2
│ │ 0x140007e56 mov qword [rdi + 0x48], 0
│ │ 0x140007e5e call sym.rz_test.exe_ht_pp_free ; sym.rz_test.exe_ht_pp_free
│ │ 0x140007e63 movaps xmm7, xmmword [var_38h]
│ │ 0x140007e68 movaps xmm6, xmmword [var_28h]
│ │ 0x140007e6d mov rbp, qword [var_10h]
│ └─> 0x140007e72 add rsp, 0x48
│ 0x140007e76 pop r15
│ 0x140007e78 pop rdi
└ 0x140007e79 ret
0x140007e6d (set rbp (loadw 0 64 (+ (var rsp) (bv 64 0x68))))
0x140007e72 (seq (set op1 (var rsp)) (set op2 (bv 64 0x48)) (set sum (+ (var op1) (var op2))) (set rsp (var sum)) (set _result (var sum)) (set _popcnt (bv 8 0x0)) (set _val (cast 8 false (var _result))) (repeat (! (is_zero (var _val))) (seq (set _popcnt (+ (var _popcnt) (ite (lsb (var _val)) (bv 8 0x1) (bv 8 0x0)))) (set _val (>> (var _val) (bv 8 0x1) false)))) (set pf (is_zero (mod (var _popcnt) (bv 8 0x2)))) (set zf (is_zero (var _result))) (set sf (msb (var _result))) (set _result (var sum)) (set _x (var op1)) (set _y (var op2)) (set cf (|| (|| (&& (msb (var _x)) (msb (var _y))) (&& (! (msb (var _result))) (msb (var _y)))) (&& (msb (var _x)) (! (msb (var _result)))))) (set of (|| (&& (&& (! (msb (var _result))) (msb (var _x))) (msb (var _y))) (&& (&& (msb (var _result)) (! (msb (var _x)))) (! (msb (var _y)))))) (set af (|| (|| (&& (msb (cast 4 false (var _x))) (msb (cast 4 false (var _y)))) (&& (! (msb (cast 4 false (var _result)))) (msb (cast 4 false (var _y))))) (&& (msb (cast 4 false (var _x))) (! (msb (cast 4 false (var _result))))))))
0x140007e76 (seq (set r15 (cast 64 false (loadw 0 64 (+ (var rsp) (bv 64 0x0))))) (set rsp (+ (var rsp) (bv 64 0x8))))
0x140007e78 (seq (set rdi (loadw 0 64 (+ (var rsp) (bv 64 0x0)))) (set rsp (+ (var rsp) (bv 64 0x8))))
0x140007e79 (seq (set tgt (loadw 0 64 (+ (var rsp) (bv 64 0x0)))) (set rsp (+ (var rsp) (bv 64 0x8))) (jmp (var tgt)))
[1] https://github.com/rizinorg/rizin/blob/dev/doc/rzil.md[2] https://rizin.re
Thanks! The concern is how to uniformly uplift binary code from various architectures and configurations to the same IR like RzIL? Is there a method to automate the disassembly process reliably across these different systems?
What do you mean? The Rizin code does all the hard part (we add uplifting code for every architecture manually; you can see a list of supported architectures using `rz-asm -L` and check for the `I` letter, which means "IL." You need to call the necessary APIs. See, for example, how it's done in one of the integration tests[1]. As for the use of Rizin from Python, we have a rz-bindgen[2][3].
[1] https://github.com/rizinorg/rizin/blob/dev/test/integration/...
For me the huge difference between re-compilability and re-excuteability scores is very interesting. GTP4 achieved 8x% on re-compilability (syntactically correct) but abysmal 1x% in re-excutability (schematically correct) demonstrated once again its overgrown mimicry capacity.
overgrown mimicry
I don't think it shows that. GPT4 was not trained on decompiling binaries back into C. Amazing result for an untrained task.
We are soon going to have robust toolchain detection from binaries, and source recovery with variable and function names.
We're interested in the toolchain, could you share the link or reference to it? GPT4 does an amazing work, we're also very surprised that it can work.
It’s always cool to see different approaches in this area, but I worry its benchmarks are meaningless without a comparison of non-AI based approaches (like IDA Pro). It would be interesting to see how this model holds up on metrics from previous papers in security.
Thanks! We're working on Ghidra/IDA pro. The problem we face is the right kind of data to test with and how to evaluate it. It's like there's no "standard" benchmark/metrics that everyone uses for decompilation.
As others have said, the standardization of metrics is still something debated, but at the same time, this space has been explored by various top-tier papers that your paper did not cite. For example, DREAM [1], evaluated using the classic metric of goto-emittence. Rev.ng [2], evaluated using Cyclomatic Complexity and gotos. SAILR [3], evaluated using the previous metrics and a Graph Edit Distance score for the structure of the code.
I feel that without a justification for dropping previously established metrics by the peer review process, you weaken your new metrics. However, I still think this is an interesting paper. It just could be made more legit by thoroughly reading/citing previous work in the area and building an argument for why you may go against it.
[1]: https://net.cs.uni-bonn.de/fileadmin/ag/martini/Staff/yakdan... [2]: https://rev.ng/downloads/asiaccs-2020-paper.pdf [3]: https://www.usenix.org/system/files/sec23winter-prepub-301-b...
This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code
I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.
I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.
Pretty wild how well GPT4 is still doing in comparison. It's significantly better than their model at creating compilable code, but is less accurate at recreating functional code. Still quite impressive.
Yes, GPT4 is very impressive, as it's not directly trained on the decompilation. We're working on improving our model, please keep watching updates!
I’d be impressed if it could do C++ as well as C, which this doesn’t.
I have thought about doing something similar for heavily obfuscated JavaScript. Very useful for security research I imagine!
Ideally, with a substantial dataset of obfuscated JavaScript and corresponding raw code, a language model could potentially make good predictions. The first key difficulty, however, is collecting a large-scale dataset and setting up a system for automatic compilation and segment out the binary-source pairs.
The approach here is interesting in that it answers a question a lot of people have been asking: “what happens if we pipe a binary into a trained LLM and ask it to decompile it?” The answer is that it doesn’t really work at all right now! This is a surprising result because the design of the paper kind of doesn’t allow for any other conclusion to be drawn. Notably, if the LLM did a really good job in the evaluation they designed it would still be unclear whether it was actually useful, because the test “does it compile and pass a few test cases” is not actually a very good way to test a decompiler.
A couple people here have suggested that the generated decompilation should match the source code exactly, which is a challenging thing to achieve and still hotly debated on whether it is a good metric or not. But the results here show that we’re starting to barely get past the “does it produce code” stage and move towards “does it produce code that looks vaguely correct” status but we’re definitely not there yet. Future steps of “is this a useful tool to drive decompilation” and “does this do better than state of the art” and “is this perfect at decompiling things” are still a long ways away. So it’s good to look at as a negative result as this area continues to attract new interest.
Thanks! Our initial experiments indicate that for simple cases, such as short snippets (tens of lines) of code without external dependencies, the LLM can decompile very well. However, for more complicated examples, it tends to offer speculative solutions, and the utility of these results is challenging to assess. The determination of whether the decompiled output is correct or useful is subjective and lacks a universal standard. One approach we're considering is utilizing GPT-4 as a benchmark to evaluate other models' performance. We're open to further suggestions to refine our evaluation methods.
relevant: https://news.ycombinator.com/item?id=34250872 (G-3PO: A protocol droid for Ghidra, or GPT-3 for reverse-engineering <https://github.com/tenable/ghidra_tools/blob/main/g3po/g3po....>; Jan, 2023; 44 comments)
ed: seems they have this, too, which may value your submission: https://github.com/tenable/awesome-llm-cybersecurity-tools#a...
We find several work on refining Ghidra decompilation results with GPTs, that could be another interesting directions!
Decompilation is somewhat a default choice for ML in the world of comp-sec.
Searching for vulns and producing patches in source code is a bit problematic, as the databases of vulnerable source code examples and their corresponding patches are neither well-structured nor comprehensive, and sometimes very, very specific to the analyzed code (for higher abstraction type of problems). So, it's not easy to train something usable beyond standard mem safety problems and use of unsafe APIs.
The area of fuzzing is somewhat messy, with sporadic efforts undertaken here and there, but it also requires a lot of preparatory work, and the results might not be groundbreaking unless we reach a point where we can feed an ML model the entire source code of a project, allowing it to analyze and identify all bugs, producing fixes and providing offending inputs. i.e. not yet.
While decompilation is a fairly standard problem, it is possible to produce input-output pairs somewhat at will based on existing source code, using various compiler switches, CPU architectures, ABIs, obfuscations, syscall calling conventions. And train models on those input-output pairs (i.e. in reversed order).
Thanks! But people want an all-in-one solution for decompilation. Given the vast array of architectures and compilation settings, and the fact that these information are usually not predetermined, finding a way to effectively navigate this complexity is quite difficult.
It seems to me that the objdump step (to transform binary to human readable assembly) seems an unnecessary waste of runtime resources.
It should be possible to tokenize directly from the binary.
Thanks! Processing raw binary data directly would be inefficient for the language model, as it's not designed to interpret strings of zeros and ones but for understanding higher-level instructions (like code and natural language).
Let's hope it kills Denuvo ...
Decompilation and deobfuscation are related but distinct tasks
How does it actually compare to non-LLM decompilers IDA, Binja, etc? I only see comparisons with other LLMs.
Thanks! We're working on Ghidra/IDA pro. The problem we face is the right kind of data to test with and how to evaluate it. It's like there's no "standard" benchmark/metrics that everyone uses for decompilation.
It seems the next logical step would be LLMAssistedHacking to turn things up side down…
Basically predicting code token by token except now you don’t even have a large enough context size and worse, you are using RAG
If I read the "re-executability" results in the Results figure right then that's a great idea but it doesn't really work:
https://raw.githubusercontent.com/albertan017/LLM4Decompile/...
To clarify:
> Re-executability provides this critical measure of semantic correctness. By re-compiling the decompiled output and running the test cases, we assess if the decompilation preserved the program logic and behavior. Together, re-compilability and re-executability indicate syntax recovery and semantic preservation - both essential for usable and robust decompilation.
If successful wouldn’t you be replicating the compilers machine code 1:1?
In which case that means fully complete code can live in the “latent space” but is distributed as probabilities
Or perhaps more likely would it be replicating the logic only, which can then be translated into the target language
I would guess that any binary that requires a non-deterministic input (key, hash etc…) to compile would break this
Fascinating
This is why round-tripping the code is important.
If you decompile the binary to source, then compile the source back to binary you should get the original binary.
You just need to do this enough times until the loss drops to some acceptable amount.
It's a great task for reinforcement learning, which is known to be unreasonably effective for these types of problems.
You really can't expect that if you're not using exactly the same version of exactly the same compiler with exactly the same flags, and often not even then.
You try your best, and if you provide enough examples, it will undoubtedly get figured out.
I think you're misunderstanding OP's objection. It's not simply a matter of going back and forth with the LLM until eventually (infinite monkeys on typewriters style) it gets the same binary as before: Even if you got the exact same source code as the original there's still no automated way to tell that you're done because the bits you get back out of the recompile step will almost certainly not be the same, even if your decompiled source were identical in every way. They might even vary quite substantially depending on a lot of different environmental factors.
Reproducible builds are hard to pull off cooperatively, when you control the pipeline that built the original binary and can work to eliminate all sources of variation. It's simply not going to happen in a decompiler like this.
Well, no, but yes.
The critical piece is that this can be done in training. If I collect a large number of C programs from github, compile them (in a deterministic fashion), I can use that as a training, test, and validation set. The output of the ML ought to compile to the same way given the same environment.
Indeed, I can train over multiple deterministic build environments (e.g. different compilers, different compiler flags) to be even more robust.
The second critical piece is that for something like a GAN, it doesn't need to be identical. You have two ML algorithms competing:
- One is trying to identify generated versus ground-truth source code
- One is trying to generate source code
Virtually all ML tasks are trained this way, and it doesn't matter. I have images and descriptions, and all the ML needs to do is generate an indistinguishable description.
So if I give the poster a lot more benefit of the doubt on what they wanted to say, it can make sense.
Oh, I was assuming that Eager was responding to klik99's question about how we could identify hallucinations in the output—round tripping doesn't help with that.
If what they're actually saying is that it's possible to train a model to low loss and then you just have to trust the results, yes, what you say makes sense.
I haven't found many places where I trust the results of an ML algorithm. I've found many places where they work astonishingly well 30-95% of the time, which is to say, save me or others a bunch of time.
It's been years, but I'm thinking back through things I've reverse-engineered before, and having something which kinda works most of the time would be super-useful still as a starting point.
Have you ever trained a GAN?
Technically, yes!
A more reasonable answer, though, is "no."
I've technically gone through random tutorials and trained various toy networks, including a GAN at some point, but I don't think that should really count. I also have a ton of experience with neural networks that's decades out-of-date (HUNDREDS of nodes, doing things like OCR). And I've read a bunch of modern papers and used a bunch of Hugging Face models.
Which is to say, I'm not completely ignorant, but I do not have credible experience training GANs.
That's true but a solvable problem. I once tried to reproduce the build of an uncooperative party and it was mainly tedious and boring.
The space of possible compiler arguments is huge, but ultimately what is actually used is mostly on a small surface.
Apart from that, I wrote a small tool to normalize the version string, timestamps and file path' in the binaries before I compared them. I know there are other sources of non-determinism, but these three things were enough in my case.
The hardest part were the numerous file path' from the build machine. I had not expected that. In hindsight, stripping both binaries before comparison might have helped, but I don't remember why I didn't do that.
What exactly are you suggesting will get figured out?
The mapping from binary to source code.
Even ignoring all sources of irreproducibility, there does not exist a bijection between source and binary artifact irrespective of tool chain. Two different toolchains could compile the same source to different binaries or different sources to the same binary. And you absolutely shouldn't be ignoring sources of irreproducibility in this context, since they'll cause even the same toolchain to keep producing different binaries given the same source.
Exactly, but neither the source nor the binary is what's truly important here. The real question is: can the LLM generate the functionally valid source equivalent of the binary at hand? If I disassemble Microsoft Paint, can I get code that will result in a mostly functional version of Microsoft Paint, or will I just get 515 compile errors instead?
This is what I thought the question was really about.
I assume that an llm will simply see patterns that look similar to other patterns and make assosciations and assume ewuivalences on that level, meanwhile real code is full of things where the programmer, especially assembly programmers, modify something by a single instruction or offset value etc to get a very specific and functionally important result.
Often the result is code that not only isn't obvious, it's nominaly flatly wrong, violating standards, specs, intended function, datasheet docs, etc. If all you knew were the rules written in the docs, the code is broken and invalid.
Is the llm really going to see or understand the intent of that?
They find matching patterns in other existing stuff, and to the user who can not see the infinite body of that other stuff the llm pulled from, it looks like the llm understood the intent of a question, but I say it just found the prior work of some human who understood a similar intent somewhere else.
Maybe an llm or some other flavor of ai can operate some other way like actually playing out the binary like executing in a debugger and map out the results not just look at the code as fuzzy matching patterns. Can that take the place of understanding the intents the way a human would reading the decompiled assembly?
Guess we'll be finding out sooner of later since of course it will all be tried.
The question was about the reverse mapping.
Except LLMs cannot reason.
LLMs can mimic past examples of reasoning from the dataset. So, it can re-use reasoning that it has already been trained on. If the network manages to generalize well enough across its training data, then it can get close to reproducing general reasoning. But it can't yet fully get there, of course.
Do you have evidence LLMs can indeed generalize outside their training data distribution?
https://twitter.com/abacaj/status/1721223737729581437/photo/...
No. I know only that they can generalize within it, and only to a limited degree, but don't have solid evidence of even that.
Err, no, sorry, it won't. Compilers don't work that way. There's a lot of ways to compile down source to machine code and the output changes from compiler version to compiler version. The LLM would have to know exactly how the compiler worked at which version to do this. So the idea is technically possible but not technically feasible.
Maybe we then need an LLM to tell us if two pieces of compiled code are equivalent in an input-output mapping sense (ignoring execution time).
I'm actually serious; it would be exceedingly easy to get training data for this just by running the same source code through a bunch of different compiler versions and optimization flags.
Why would an llm be the tool for that job?
Without analytical thinking how else would you come to conviction that two functions are identical, for a computationally unfeasible number of possible inputs?
An LLM cannot do this. I don’t even mean this in a formal sense, because your problem is addressed by Rice’s Theorem, which places bounds on what any system (LLM or not) can do here; I mean it in the sense that an LLM isn’t even appropriate to use here because the best it can possibly do is provide you with its best guess at the answer. And while this might be a useful property for decompilation in general that’s not what was being discussed here.
Rice's theorem does NOT prevent a program from giving correct answers to non-trivial properties of programs (including the halting problem or other undecidable problems) for 99.99% of inputs and "I don't know" for 0.01% of inputs. It only states that you cannot write a program that provides a correct and definitive yes-or-no for 100% of inputs.
For a decompiler, being able to decompile even 90% of programs would be awesome. We're not looking for theoretical perfectness.
Right.
A less formidable problem with higher chances of succeeding is from a given binary to figure out first compiler, compiler-version, compiler-flags.
From there you could have a model for every combination or at least a model for the compiler variant and use the other info (version, flags) as input to the model.
Yes, that's a limitation of trying to ensure exact binary reconstruction. Luckily there is also a separate line of work on detecting the compiler version and optimization flags based on a binary – it turns out this is not that hard and it's easy to get a bunch of labeled data for a classifier.
If folks are interested in reading more there's a nice paper by Grammatech on the idea: https://eschulte.github.io/data/bed.pdf (though it's pre-LLM and uses evolutionary algorithms on the initial decompilation to search for a version that recompiles exactly).
Doesn't that depend on the compiler's version though? Or, for that matter, even the sub-version. Every compiler does things differently.
From the README:
As this is in the metrics section, I guess fully automating this is not part of the research.
According to the project's README, they only seem to be checking mere "re-compilability" and "re-executability" of the decompiled code, though.
The way to do this is to have a formal verification tool that takes the input, the output, and a formal proof that the input matches the semantics of the output, and have the LLM create the formal proof alongside the output. Then you can run the verification tool to check if the LLM’s output is correct according to the proof that it also provided.
Of course, building and training an LLM that can provide such proofs will be the bigger challenge, but it would be a safe a way to detect hallucinations.
That would require the tool to prove the equivalence of the two programs, which is generally undecidable. Maybe this could be weakened to preserving some properties of the program.
No, it would not. It would require the LLM to provide a proof for the program that it outputs, which seems reasonable in the same way that a human decompiling a program would be able to provide a record of his/her reasoning.
The formal verifier would then merely check the provided proof, which is a simple mechanical process.
This is analogous to a mathematician providing a detailed proof and a computer checking it.
What is impossible due to undecidability is for two arbitrary programs, to either prove or disprove their equivalence. However, the two programs we are talking about are highly correlated, and thus not arbitrary at all with respect to each other. If an LLM is able to provide a correct decompilation, then in principle it should also be able to provide a proof of the correctness of that decompilation.
Yes, and then someone needs to check that proof. It’s not particularly clear if that decompilation proof would be any more helpful than just doing the lifting by hand.
No, nobody needs to check the proof; that's the whole point of formal theorem proving, the machine checks it for you.
That doesn’t mean that it’s impossible, right? Just that no tool is guaranteed to give an answer in any case. And those cases might be 90%, 10% or it-doesn’t-matter-in-practice %
What if there are hallucinations in the verification tool?
I think the idea is that you'd have two independently-develooed systems, one LLM decompiling the binary and the other LLM formally verifying. If the verifier disagrees with the decompiler you won't know which tool is right and which is wrong, but if they agree then you'll know the decompiled result is correct, since both tools are unlikely to hallucinate the same thing.
No, the idea is that the verifier is a human-written program, like the many formal-verification tools that already exist, not an LLM. There is zero reason to make this an LLM.
It makes sense to use LLMs for the decompilation and the proof generation, because both arguably require creativity, but a mere proof verifier requires zero creativity, only correctness.
Then it's not a formal verification tool. Generative models are profoundly unfit for that purpose.
There may be bugs, but not hallucinations. Bugs are at least reproducible, and the source code of the verification tool is much, much smaller than an LLM, so has a much higher chance of its finite number of bugs to be found, whereas with an LLM it is probably impossible to remove all hallucinations.
To turn your question around: What if the compiler that compiles your LLM implementation “hallucinates”? That would be the closer parallel.
Good luck formally proving Linux.
The goal is to prove that the source code matches the machine code, not to prove that the code implements some intended higher-level semantics. This has nothing to do with formally proving the correctness of the Linux kernel.
Generators' nature is to hallucinate.
One man’s hallucination is another’s creativity.
Well we need to remember that "hallucination" here is not a concept but a language figure for the output of a stochastic parroting machine. So what you mentinoed would be a digitally induced halluciation out of some dancing matrix multiplications / electrons on silicon.
LLMs are by nature probabilistic, which is why they work reasonably well for "imprecise" domains like natural language processing. Expecting one to do decompilation, or disassembly for that matter, is IMHO very much a "wrong tool for the job" --- but perhaps it's just an exploratory exercise for the "just use an LLM" meme that seems to be a common trend these days.
The bigger argument against the effectiveness of this approach is that existing decompilers can already do a much better job with far less processing power.
In the future efficient rule based compilers and decompiler may be generated by AI systems trained on inputs and outputs of what we use today.
This effort is an exploration to find a radically different AI way that may give superior results.
Yes. For all the reasons you give above, AI for this job is not practical today.
One could as well use differential fuzzing.
I'm amazed that there are so many good responses above only this mentions fuzzing. In the context of security, inputs might be non-linear things like adjacent memory, so I don't see anyway to be confident about equilivancy without substantial fuzzing.
Honestly I just don't see a way to formally verify this at all, it's sounds like it could be a very useful tool but I don't see a way for it to be fully confident. But, heck, just getting you 90% of the way towards understanding it with LLMs is still amazing and useful in real life.
Even if it isn't fully reliable, often it's only necessary to modify a few functions for most changes one wants to make to a binary.
You'd therefore only need to recompile those few functions.
The detail how they measure this in the readme. This is directed at all the sibling comments as well!
TLDR they recompile and then re-execute (including test suites). From the results table it looks like GPT4 still "outperforms" their model in recompilation, but their recompiled code has a much better re-execution success rate (less hallucinations). But, that re-execution rate is still pretty lacking (around 14%), even if better than GPT4.
Sounds like it should be able to split the code into functions with inferred API, then you should be able to fuzz these functions in binary and source versions.