This article provides an excellent overview of the latest in speed of optimizer vs quality of optimization.
In particular, copy-and-patch compilation is still the fastest approach because it uses pre-compiled code, though leaves little room for optimization.
Cranelift uses e-graphs to represent equivalence on the IR. This allows for more optimizations than the copy-and-patch approach.
Of course, the most optimized output is going to come from a more traditional compiler toolchain like LLVM or GCC. But for users who want to get "fast enough" output as quickly as possible, newer compiler techniques provide a promising alternative.
Is there any literature or guidelines on what to do if I'm willing to spend effectively unlimited CPU cycles in return for a more optimized final output?
Superoptimizers: https://en.wikipedia.org/wiki/Superoptimization
Also, program distillation: https://www.researchgate.net/publication/220989887_Distillat...
Isn't it a bit weird that this isn't just the standard? Like imagine if Chrome was optimized with such a superoptimizer; let the optimizer spend a couple hours every month or so when cutting a new release. Surely that has to be worth it?
I'm not a compiler engineer, but I think it's a diminishing returns issue. Modern optimising compilers are already impressive, and they get more impressive year on year, but only quite slowly. We don't see compilers producing code that runs 30% faster than the code generated last year, for instance.
Such improvements can happen with compiler updates, but not when the starting point is a compiler that's already of decent quality. I recall some GPU driver updates from ATi (years ago) delivered pretty drastic performance improvements, but I believe that's because the original drivers were rather primitive.
(Perhaps a drastic improvement to autovectorisation could give a 30% boost, or better, but this would apply only to certain programs.)
You could grant a compute budget 100x the typical build set-up, but no one has built a production-ready compiler to take advantage of that, and if they did I suspect the improvements would be unimpressive. They may also run into a 'complexity ceiling' issue, as the compiler would presumably be even more complex than today's ordinary optimising compilers, which are already enormous.
As Filligree says, superoptimisers tend to only be practical for very short programs. They can't be applied to monstrous codebases like Chromium.
The ability of compilers to make code faster is 12 (twelve) times slower than Moore's law: they double the program's speed in 18 years [1].
[1] https://proebsting.cs.arizona.edu/law.html
This might seem discouraging, but it is not - one can still reap the benefits of code optimization twelve times as long after Moore's law stops working.
But maybe you can superoptimize some hot sections, and encode the superoptimizer findings somewhere. Then the compiler can validate the optimizations and apply them to the particular piece of code for the rest of the program life, untill the preconditions hold.
I am pretty sure the existing release process already takes much more than 2 hours.
Doing a full-fledged release-this-to-millions-of-users build of Chrome or Firefox already takes on the order of 24 hours (or at least it did when last I checked a few years ago).
Chrome already uses PGO: https://blog.chromium.org/2020/08/chrome-just-got-faster-wit...
I think the wide variety of hardware and complexities of running on a real system sometimes make this problem more difficult.
Even within a given processor microarchitecture (say, just Zen 2, or just Haswell), different CPUs will be running at different frequencies, have different cooling solutions, and be running different microcode releases, all of which will affect which program is the fastest. And this is without considering cache pressure or memory latency, which is also dependent on any other programs the user happens to be running.
Running a superoptimizer for your linear algebra program that runs on your Cray supercluster can give clear gains. Doing the same for every combination of user hardware seems less feasible - you may find output that is a clear win for the machines you tested on, but it's often possible that it will lose out on other machines.
Is it just a couple hours? Even just ordering the optimisation passes is already an NP-complete process, so I could easily imagine superoptimisers would take trillions of years for a large program...
Yes, it is weird.
One thing that would help is that if we explicitly reimagine programming is characterizing the search space for the compiler, just as the e-graphs stuff in the article talks about separating generating alternatives from finding the best alternative.
Have a look at Unison which uses constraint programming to do optimal code generation.
https://unison-code.github.io/
From what I understand, the big advantage of the e-graphs approach is, that the quality of the output is (within limits) a function the time and memory given.
The more memory, the more nodes can be generated in the e-graph and the more time for search, the better the selected node.
It might never be as fast as copy-and-patch or as good as LLVM or GCC, but this flexibility is a value in itself.
When we first reviewed the equality saturation paper, we thought there was one major pro, and one major con: [pro] phase-ordering invariant; [con] at the time there was no (believable) way to extract the transformed graph in a non-hand-wavy-way.
Personally, I think e-graphs should be combined with Massalin superoptimization — they're natural "duals" — and just turn the whole exercise into a hill-climbing process. You can tune the total effort by the set of passes used, the amount of time to drive the graph to saturation, and the method (and time) for graph extraction.
Can you memoize across invocations so that the time spent optimizing is cumulative across all builds?
Speaking as an e-graph rookie - I believe so, with caveats.
You can certainly store the chosen extractions/optimizations and re-use any that don't change. For example, if a long chain of rewrites yields A ==> B ==> ... ==> Z, you could short-circuit that and assume that Z is still what you want to replace A with. Perhaps each time you compiled, you could only rewrite the 'changed' sections of the program, along with a randomly selected portion to explore for greater-optimization. (Although, you may just want to go all the way and write a tool that constantly adds to your equivalence DB in the background, with compilation just a matter of running your extraction heuristics.)
You probably wouldn't want to store every possible rewrite, though, as the space complexity or retrieval cost is likely a bigger deal than the time complexity (you wouldn't load 2+2=4 from a database, you'd just do the addition). But, if you restore only a subset of equivalencies that are known to be useful, there's not a simple way to keep the optimization from re-exploring the "rejected candidates" (without shutting off the eclass entirely) - and, it's even not clear to me that you'd want to, as changing program context may end up meaning that M becomes a better replacement candidate for A than Z.
(Boring prerequisites for this approach also include: consistent serialization of terms, avoiding storage of infinite equivalencies, like `A => A + 0 => A + 0 + 0 ==> ...` (and possibly of exponential equivalencies like the kind associativity rules can produce), and strong heuristics to capture "high-value" equivalencies that are both useful and cheaper to lookup than re-derive.)
We actually take a fairly unconventional approach to e-graphs: we have a few linear passes and we do all rewrites eagerly, so we use them to provide a general framework for the fixpoint problem into which we plug in all our rewrites, but we don't have the usual "apply as much CPU time as you want to get better results" property of conventional equality saturation.
I gave a talk about this approach, aegraphs (acyclic e-graphs), here: slides (https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf), video (https://vimeo.com/843540328)
(disclosure: Cranelift tech lead 2020-2022 and main author of the e-graphs mid-end, as well as regalloc, isel and its custom DSL, and other bits, along with the excellent team)
I found the video to be great! Informative and understandable even when I'm just casually interested in compilers, with not a lot of real world experience writing them.
Agree, especially when developing, I'd assume speed of optimizer matters more than quality of optimization. I wonder though if LLVM spent less time on that "phase ordering" problem, what is the tradeoff between these two factors?
LLVM doesn’t spend really any runtime solving the phase ordering problem since the pass pipelines are static. There have been proposals to dynamically adjust the pipeline based on various factors, but those are a ways out from happening.