HN comments for: Cranelift code generation comes to Rust

chrisaycock

21 replies

1d5h

2024-03-18 12:42:37 UTC

This article provides an excellent overview of the latest in speed of optimizer vs quality of optimization.

In particular, copy-and-patch compilation is still the fastest approach because it uses pre-compiled code, though leaves little room for optimization.

Cranelift uses e-graphs to represent equivalence on the IR. This allows for more optimizations than the copy-and-patch approach.

Of course, the most optimized output is going to come from a more traditional compiler toolchain like LLVM or GCC. But for users who want to get "fast enough" output as quickly as possible, newer compiler techniques provide a promising alternative.

vsnf

12 replies

1d4h

2024-03-18 14:07:49 UTC

Is there any literature or guidelines on what to do if I'm willing to spend effectively unlimited CPU cycles in return for a more optimized final output?

thesz

10 replies

1d4h

2024-03-18 14:13:10 UTC

Superoptimizers: https://en.wikipedia.org/wiki/Superoptimization

Also, program distillation: https://www.researchgate.net/publication/220989887_Distillat...

mort96

9 replies

22h27m

2024-03-18 19:58:31 UTC

Isn't it a bit weird that this isn't just the standard? Like imagine if Chrome was optimized with such a superoptimizer; let the optimizer spend a couple hours every month or so when cutting a new release. Surely that has to be worth it?

MaxBarraclough

2 replies

21h14m

2024-03-18 21:10:35 UTC

I'm not a compiler engineer, but I think it's a diminishing returns issue. Modern optimising compilers are already impressive, and they get more impressive year on year, but only quite slowly. We don't see compilers producing code that runs 30% faster than the code generated last year, for instance.

Such improvements can happen with compiler updates, but not when the starting point is a compiler that's already of decent quality. I recall some GPU driver updates from ATi (years ago) delivered pretty drastic performance improvements, but I believe that's because the original drivers were rather primitive.

(Perhaps a drastic improvement to autovectorisation could give a 30% boost, or better, but this would apply only to certain programs.)

You could grant a compute budget 100x the typical build set-up, but no one has built a production-ready compiler to take advantage of that, and if they did I suspect the improvements would be unimpressive. They may also run into a 'complexity ceiling' issue, as the compiler would presumably be even more complex than today's ordinary optimising compilers, which are already enormous.

As Filligree says, superoptimisers tend to only be practical for very short programs. They can't be applied to monstrous codebases like Chromium.

thesz

0 replies

19h32m

2024-03-18 22:53:06 UTC

The ability of compilers to make code faster is 12 (twelve) times slower than Moore's law: they double the program's speed in 18 years [1].

[1] https://proebsting.cs.arizona.edu/law.html

This might seem discouraging, but it is not - one can still reap the benefits of code optimization twelve times as long after Moore's law stops working.

alserio

0 replies

20h21m

2024-03-18 22:03:46 UTC

But maybe you can superoptimize some hot sections, and encode the superoptimizer findings somewhere. Then the compiler can validate the optimizations and apply them to the particular piece of code for the rest of the program life, untill the preconditions hold.

remorses

0 replies

21h44m

2024-03-18 20:40:58 UTC

I am pretty sure the existing release process already takes much more than 2 hours.

kibwen

0 replies

21h46m

2024-03-18 20:38:47 UTC

Doing a full-fledged release-this-to-millions-of-users build of Chrome or Firefox already takes on the order of 24 hours (or at least it did when last I checked a few years ago).

kccqzy

0 replies

17h8m

2024-03-19 01:16:52 UTC

Chrome already uses PGO: https://blog.chromium.org/2020/08/chrome-just-got-faster-wit...

fwip

0 replies

24m

2024-03-19 18:00:52 UTC

I think the wide variety of hardware and complexities of running on a real system sometimes make this problem more difficult.

Even within a given processor microarchitecture (say, just Zen 2, or just Haswell), different CPUs will be running at different frequencies, have different cooling solutions, and be running different microcode releases, all of which will affect which program is the fastest. And this is without considering cache pressure or memory latency, which is also dependent on any other programs the user happens to be running.

Running a superoptimizer for your linear algebra program that runs on your Cray supercluster can give clear gains. Doing the same for every combination of user hardware seems less feasible - you may find output that is a clear win for the machines you tested on, but it's often possible that it will lose out on other machines.

Filligree

0 replies

21h22m

2024-03-18 21:03:17 UTC

Is it just a couple hours? Even just ordering the optimisation passes is already an NP-complete process, so I could easily imagine superoptimisers would take trillions of years for a large program...

Ericson2314

0 replies

3h35m

2024-03-19 14:49:59 UTC

Yes, it is weird.

One thing that would help is that if we explicitly reimagine programming is characterizing the search space for the compiler, just as the e-graphs stuff in the article talks about separating generating alternatives from finding the best alternative.

sanxiyn

0 replies

1d3h

2024-03-18 14:41:06 UTC

Have a look at Unison which uses constraint programming to do optimal code generation.

https://unison-code.github.io/

weinzierl

5 replies

1d5h

2024-03-18 12:50:01 UTC

From what I understand, the big advantage of the e-graphs approach is, that the quality of the output is (within limits) a function the time and memory given.

The more memory, the more nodes can be generated in the e-graph and the more time for search, the better the selected node.

It might never be as fast as copy-and-patch or as good as LLVM or GCC, but this flexibility is a value in itself.

thechao

2 replies

21h44m

2024-03-18 20:41:30 UTC

When we first reviewed the equality saturation paper, we thought there was one major pro, and one major con: [pro] phase-ordering invariant; [con] at the time there was no (believable) way to extract the transformed graph in a non-hand-wavy-way.

Personally, I think e-graphs should be combined with Massalin superoptimization — they're natural "duals" — and just turn the whole exercise into a hill-climbing process. You can tune the total effort by the set of passes used, the amount of time to drive the graph to saturation, and the method (and time) for graph extraction.

vlovich123

1 replies

18h0m

2024-03-19 00:25:34 UTC

Can you memoize across invocations so that the time spent optimizing is cumulative across all builds?

fwip

0 replies

2h3m

2024-03-19 16:22:29 UTC

Speaking as an e-graph rookie - I believe so, with caveats.

You can certainly store the chosen extractions/optimizations and re-use any that don't change. For example, if a long chain of rewrites yields A ==> B ==> ... ==> Z, you could short-circuit that and assume that Z is still what you want to replace A with. Perhaps each time you compiled, you could only rewrite the 'changed' sections of the program, along with a randomly selected portion to explore for greater-optimization. (Although, you may just want to go all the way and write a tool that constantly adds to your equivalence DB in the background, with compilation just a matter of running your extraction heuristics.)

You probably wouldn't want to store every possible rewrite, though, as the space complexity or retrieval cost is likely a bigger deal than the time complexity (you wouldn't load 2+2=4 from a database, you'd just do the addition). But, if you restore only a subset of equivalencies that are known to be useful, there's not a simple way to keep the optimization from re-exploring the "rejected candidates" (without shutting off the eclass entirely) - and, it's even not clear to me that you'd want to, as changing program context may end up meaning that M becomes a better replacement candidate for A than Z.

(Boring prerequisites for this approach also include: consistent serialization of terms, avoiding storage of infinite equivalencies, like `A => A + 0 => A + 0 + 0 ==> ...` (and possibly of exponential equivalencies like the kind associativity rules can produce), and strong heuristics to capture "high-value" equivalencies that are both useful and cheaper to lookup than re-derive.)

cfallin

1 replies

16h55m

2024-03-19 01:30:10 UTC

We actually take a fairly unconventional approach to e-graphs: we have a few linear passes and we do all rewrites eagerly, so we use them to provide a general framework for the fixpoint problem into which we plug in all our rewrites, but we don't have the usual "apply as much CPU time as you want to get better results" property of conventional equality saturation.

I gave a talk about this approach, aegraphs (acyclic e-graphs), here: slides (https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf), video (https://vimeo.com/843540328)

(disclosure: Cranelift tech lead 2020-2022 and main author of the e-graphs mid-end, as well as regalloc, isel and its custom DSL, and other bits, along with the excellent team)

scoopr

0 replies

4h8m

2024-03-19 14:17:32 UTC

I found the video to be great! Informative and understandable even when I'm just casually interested in compilers, with not a lot of real world experience writing them.

kapilsinha

1 replies

1d5h

2024-03-18 12:52:01 UTC

Agree, especially when developing, I'd assume speed of optimizer matters more than quality of optimization. I wonder though if LLVM spent less time on that "phase ordering" problem, what is the tradeoff between these two factors?

boomanaiden154

0 replies

1d2h

2024-03-18 15:58:34 UTC

LLVM doesn’t spend really any runtime solving the phase ordering problem since the pass pipelines are static. There have been proposals to dynamically adjust the pipeline based on various factors, but those are a ways out from happening.

posix_monad

11 replies

1d2h

2024-03-18 15:47:12 UTC

Can anyone explain why Cranelift is expected to be faster than LLVM? And why those improvements can't also be applied to LLVM?

norman784

6 replies

1d1h

2024-03-18 16:26:02 UTC

why those improvements can't also be applied to LLVM?

LLVM is a huge and bloated ecosystem (it has tons of tools and millions of LoC), also the code base itself is pretty old or rather the project has his age, so there's a lot of legacy code, other aspect is that is hard to try new/radical things because how big the project itself is.

posix_monad

4 replies

2024-03-18 17:41:33 UTC

X is too bloated! We need Y, which is leaner and more tailored to our use-case.

(years pass...)

Y is too bloated! We need Z...

sanxiyn

1 replies

17h56m

2024-03-19 00:29:19 UTC

After years of trying (IIRC there was funded effort from Intel), LLVM still can't parallelize code generation at function level. Cranelift had function level code generation parallelization from day 1.

azakai

0 replies

14h32m

2024-03-19 03:52:55 UTC

I have always been told that LLVM doesn't parallelize codegen because the right place for that is in the build system: just do make -j N (or use ninja, which parallelizes automatically). And when you're building lots of source files in a normal project that's definitely true.

It's different when you're a wasm VM that receives big chunk of wasm that contains many source files, and you get it all at once. And for that reason pretty much every wasm compiler parallelizes codegen: Cranelift as you said, and also V8, SpiderMonkey, JSC, and not just VMs but also optimizers like Binaryen. It's a crucial part of their design.

For LLVM, the right design may be what it has today: single-core codegen. (LTO is the main issue there, but that's what thin LTO is for.)

dymk

0 replies

22h45m

2024-03-18 19:40:16 UTC

I don't get it. Is your point that new software shouldn't be written to fulfill new usecases? We should always bolt new patterns and architecture onto existing codebases?

alserio

0 replies

20h42m

2024-03-18 21:43:13 UTC

Yes. But in the meantime one can experiment with new things, void some assumptions. And I don't know about the LLVM project, but its quite more difficult to try different ideas in a big and old organization and codebase than it is in a greenfield project. Maybe they won't be successful, maybe they will move the whole field ahead.

papruapap

0 replies

1d1h

2024-03-18 16:51:37 UTC

iirc Zig want to replace LLVM for similar reasons, also hard to debug.

Filligree

2 replies

1d2h

2024-03-18 16:05:27 UTC

It uses E-graphs, as explained in the article. That’s a completely different approach to compilation, and to use it in LLVM you’d have to rewrite LLVM.

It’s also unlikely that the resulting code will ever be as fast as a traditional compiler’s output. It’s great for development, but I wouldn’t use it in a release build.

Rusky

1 replies

23h51m

2024-03-18 18:34:29 UTC

I don't believe the performance difference between Cranelift and LLVM really has much to do with E-graphs. Cranelift had this same performance profile (faster than LLVM but generating slower output) before they switched to E-graphs.

Rather it's about how much effort Cranelift puts toward optimizing its output- it has fewer, less involved passes (regardless of whether those "passes" are expressed as part of the E-graph framework). More subtly, this means it is also written to generate "okay" code without as much reliance on those passes- while LLVM on the other hand generates a lot of naive code at -O0 which contributes to its slower compile times in that mode.

cfallin

0 replies

16h58m

2024-03-19 01:27:04 UTC

Right, it's about algorithmic tradeoffs throughout. A good example I wrote about is here: https://cfallin.org/blog/2021/01/22/cranelift-isel-2/ where we use a single-pass algorithm to solve a problem that LLVM has a multi-part fixpoint loop to solve.

Most CPU time during compile is in the register allocator and I took a really careful approach to optimization when I rewrote it a few years ago (more details https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/). We generally try to pay close attention to algorithmic efficiency and avoid altogether the fixpoint loops, etc that one often finds elsewhere. (RA2 does backtrack and have a worklist loop, though it's pretty minimal, and also we're planning to add a single-pass mode.)

(disclosure: I was Cranelift tech lead in 2020-2022)

mort96

0 replies

22h23m

2024-03-18 20:01:35 UTC

I've heard compiler writers complain that LLVM's API is inherently slow. Something about using runtime polymorphism everywhere. I don't know how much there is to this though, I haven't investigated it myself.

k_bx

11 replies

1d5h

2024-03-18 12:34:38 UTC

Any fresh compilation time benchmarks and comparisons to LLVM?

zokier

5 replies

1d5h

2024-03-18 12:40:01 UTC

From the article:

A full debug build of Cranelift itself using the Cranelift backend took 29.6 seconds on my computer, compared to 37.5 with LLVM (a reduction in wall-clock time of 20%)

That seems much smaller difference than what I would have expected

__s

4 replies

1d5h

2024-03-18 12:55:21 UTC

Those numbers are the entire compile time, parsing/typecheck/MIR+MIR optimization/linking

Don't have numbers handy, so hard to say how much faster Cranelift is making the codegen portion, but gets into Amdahl's Law

Filligree

3 replies

1d2h

2024-03-18 16:07:18 UTC

And there are faster linkers than the default.

…why is it still the default?

3836293648

1 replies

2024-03-18 18:01:06 UTC

Compatibility with niche usecases

Filligree

0 replies

22h7m

2024-03-18 20:17:58 UTC

That sounds like a good reason to keep the existing one as an opt-in possibility.

kibwen

0 replies

21h40m

2024-03-18 20:45:10 UTC

Because the linker provided by the system itself is the one that's guaranteed to be the most broadly compatible. For a toolchain to begin distributing an alternative linker is a lot of work on every supported platform (future versions of Rust may sidestep this issue by making LLD the default on certain platforms, which they get "for free" by dint of shipping alongside LLVM). But also, because there are about ten people on planet Earth qualified to write a production-grade linker, which is why LLD still isn't broadly recommended yet.

nindalf

1 replies

1d5h

2024-03-18 13:15:50 UTC

Few reports comparing Cranelift to LLVM from this day old reddit thread [1]

- 29.52s -> 24.47s (17.1%)

- 27s -> 19s (29.6%)

- 11.5s -> 8.4s (26.9%)

- 37.5s -> 29.6s (28.7%) - this measurement from TFA.

To put these numbers in context, all the perf improvements over the last 4 years have helped the compiler become faster on a variety of workloads by 7%, 17%, 13% and 15%, for an overall speed gain of 37% over 4 years. [2] So one large change providing a 20-30% improvement is very impressive.

When you add that to the parallel frontend [3] and support for linking with LLD [4], Rust compilation could be substantially faster by this time next year.

[1] - https://old.reddit.com/r/rust/comments/1bgyo8a/try_cranelift...

[2] - https://nnethercote.github.io/2024/03/06/how-to-speed-up-the...

[3] - https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

[4] - https://github.com/rust-lang/rust/issues/39915

k_bx

0 replies

1d1h

2024-03-18 16:34:45 UTC

Thanks, that's really great news. The only thing I wish for now is aarch64 support on Apple Silicon

AlexErrant

1 replies

1d5h

2024-03-18 12:45:15 UTC

Not quite fresh, but

A paper from 2020 [0] showed that Cranelift was an order of magnitude faster than LLVM, while producing code that was approximately twice as slow on some benchmarks.

[0] https://arxiv.org/pdf/2011.13127.pdf

Findecanor

0 replies

1d4h

2024-03-18 13:57:19 UTC

The changes in Cranelift since 2020 have been quite significant so I would not put any trust in those benchmarks.

boxed

0 replies

1d5h

2024-03-18 12:42:30 UTC

The very last paragraph:

A full debug build of Cranelift itself using the Cranelift backend took 29.6 seconds on my computer, compared to 37.5 with LLVM (a reduction in wall-clock time of 20%). Those wall-clock times don't tell the full story, however, because of parallelism in the build system. Compiling with Cranelift took 125 CPU-seconds, whereas LLVM took 211 CPU-seconds, a difference of 40%. Incremental builds — rebuilding only Cranelift itself, and none of its dependencies — were faster with both backends. 66ms of CPU time compared to 90ms.

CodesInChaos

11 replies

1d3h

2024-03-18 15:03:04 UTC

You can use different backends and optimization for different crates. It often makes sense to use optimized LLVM builds for dependencies, and debug LLVM or even Cranelift for your own code.

See https://www.reddit.com/r/rust/comments/1bhpfeb/vastly_improv...

zozbot234

10 replies

1d1h

2024-03-18 17:18:56 UTC

I would not expect this to work without issues, as Rust does not support a stable binary ABI across different compiler versions. How can we be sure that the two "codegen backends" will always be implementing the same binary ABI?

aw1621107

4 replies

2024-03-18 17:57:36 UTC

I suppose that might depend on how the ABI is represented internally. If the ABI is fully described by (one of?) the lowered IR the backends consume (e.g., does MIR fully describe struct layouts/etc.) then I wouldn't expect there to be any issues outside "regular" bugs since all the relevant information would be contained in the inputs.

sanxiyn

3 replies

2024-03-18 18:04:37 UTC

It is done in the shared code, see https://rustc-dev-guide.rust-lang.org/backend/backend-agnost... for details.

CodesInChaos

1 replies

23h2m

2024-03-18 19:22:52 UTC

Struct layout happening in generic code makes sense (and is actually required since you can reference it in `const`s). It seems unlikely that they made function calling fully backend agnostic, since it'd require assigning the registers for parameters and result in generic code, and not in the backend.

I'd expect the generic code to lower the function parameters to primitive types (pointers, ints, floats, etc.), but the backend would then distribute those over registers and/or stack. Keeping that compatible would still require an (unstable) specification implemented by all compatible backends.

Unwinding might be tricky as well.

vlovich123

0 replies

18h1m

2024-03-19 00:23:50 UTC

I'd expect the generic code to lower the function parameters to primitive types (pointers, ints, floats, etc.), but the backend would then distribute those over registers and/or stack

Not sure about that. Calling conventions are defined by the platform ABI which is what the backend implements, so any conforming backend should still be mutually invokable. That's why a Rust program can emit an ABI-stable C API and why you can call GCC built libraries from an LLVM built executable. The lowering of parameters to registers is constrained by this because the intermediate representation understands that what are parameters to functions & then follows the platform ABI to lower it to function calls.

aw1621107

0 replies

20h55m

2024-03-18 21:29:42 UTC

Ah, I think that's about what I had in mind. Just didn't know what to look for. Thanks!

sanxiyn

0 replies

2024-03-18 17:56:03 UTC

In theory yes, this could be problematic with alternative Rust implementations, but it works fine and many people are using it in this case. We can be sure LLVM and Cranelift codegen backends will be implementing the same binary ABI, because binary ABI decisions are made in the shared code.

devit

0 replies

22h43m

2024-03-18 19:42:01 UTC

Presumably the LLVM and Cranelift backends in the same rustc version generate code with the same ABI, and it would be a bug if that wasn't the case.

aseipp

0 replies

22h35m

2024-03-18 19:50:04 UTC

Because the ABI is defined and implemented by shared code in the runtime and compiler. There is no need for cross version compatibility between the two backends, only compatibility within the same version. This isn't particularly new either, FWIW. The Glasgow Haskell Compiler also has an unstable ABI that is not standardized, but LLVM compiled code interoperates seamlessly with code generated by the non-LLVM compiler backend (the mechanism by which that is achieved is likely different though.)

PartiallyTyped

0 replies

22h38m

2024-03-18 19:46:45 UTC

If you do the compilation locally, you know the structure of the ABI for the compiled code. You know that regardless of the backend you are using. At this stage, no functions with generics are compiled, only what is non-generic.

What is left is to account for the ABI in any monomorphizations that occur at the boundary, i.e. when your own structures monomorphize generic functions in the dependency.

When the compiler creates this monomorphic variant, and lowers down, it can provide the necessary details of the ABI.

Ar-Curunir

0 replies

2024-03-18 17:52:43 UTC

This is for local work, so you use the same compiler for your dependencies and for your own code. Only the codegen backend differs.

diggan

8 replies

1d5h

2024-03-18 13:10:23 UTC

Tried out the instructions from the article on a tiny Bevy project, and compared it to a "normal" build:

cargo build --release 23.93s user 22.85s system 66% cpu 1:09.88 total

cargo +nightly build -Zcodegen-backend 23.52s user 21.98s system 68% cpu 1:06.86 total

Seems just marginally faster than a normal release build. Wonder if there is something particular with Bevy that makes this so? The author of the article mentions 40% difference in build speed, but I'm not seeing anything near that.

Edit: just realized I'm caching my release builds with sccache and a local NAS, hence the release builds being as fast as Cranelift+debug builds. Trying it again with just debug builds and without any caching:

cargo +nightly build 1997.35s user 200.38s system 1878% cpu 1:57.02 total

cargo +nightly build -Zcodegen-backend 280.96s user 73.06s system 657% cpu 53.850 total

Definitely an improvement once I realized what I did wrong, about half the time spent compiling now :) Neat!

naasking

3 replies

1d5h

2024-03-18 13:17:27 UTC

Maybe your build is not limited by code generation, which seems like the only thing that changed here. There was a good thread recently about the variation in what slows down compilation:

https://news.ycombinator.com/item?id=39721922

diggan

2 replies

1d5h

2024-03-18 13:22:18 UTC

Edited my comment now, forgot I was caching the release builds with sccache so wasn't actually compiling all the units, but fetching a lot of them instead :/

vlovich123

1 replies

2024-03-18 17:37:37 UTC

Which speaks to maybe that cargo should just ship with sccache turned on by default so that the “normal” experience for developing Rust has this seamless experience. Intelligent caching should always win. I’d like to see rustc get sccache integration that works at the MIR level too so that changing something that doesn’t change the code gen meaningfully still gets a cached response (e.g. changing some comments/whitespace, moving functions around, etc).

Even cooler would be if LLVM itself could also cache internal expensive parts of compilation and optimization across process instances. That would make a huge impact in cutting down incremental builds.

SkiFire13

0 replies

23h12m

2024-03-18 19:12:37 UTC

I’d like to see rustc get sccache integration that works at the MIR level too so that changing something that doesn’t change the code gen meaningfully still gets a cached response (e.g. changing some comments/whitespace, moving functions around, etc).

Isn't incremental compilation already like that?

kapilsinha

2 replies

21h56m

2024-03-18 20:29:12 UTC

Is that Bevy project open-source by any chance? I'd love to it out myself

doctor_phil

1 replies

21h49m

2024-03-18 20:35:48 UTC

Bevyengine is open source and very findable by most search engines.

Here is a direct link: https://github.com/bevyengine/bevy

kapilsinha

0 replies

21h46m

2024-03-18 20:39:23 UTC

Ah yep I'm aware of that. Bevy is a framework/engine that devs built on top of. I was referring to the project that OP has built on top of Bevy

bdhcuidbebe

0 replies

6h7m

2024-03-19 12:18:26 UTC

cache on a nas would limit cache speed to network speed. 1gbit is just about 125 megabytes/sec, while you can get a nvme disc with 4000+ megabytes/sec speed. so consider running sccache locally or just disable it for better build performance

Ericson2314

4 replies

1d3h

2024-03-18 14:49:57 UTC

Really looking forward to the death of non-e-graph-based compilation :)

convolvatron

3 replies

22h7m

2024-03-18 20:18:17 UTC

I've tried to look into this a couple times, including today. To me this looks alot like unification? but I don't really understand how operationally one gets from equivalence classes to instructions. is there an e-graphs for dummies writeup?

tekknolagi

1 replies

21h37m

2024-03-18 20:47:53 UTC

If by instructions you mean machine instructions, you don't; it's used for internal optimization passes only.

convolvatron

0 replies

21h11m

2024-03-18 21:14:30 UTC

that helps, so this is really an alternative to pushing analysis attributes aronud a dataflow graph

cfallin

0 replies

15h23m

2024-03-19 03:02:14 UTC

The approach that Cranelift uses is what we call the "aegraph" (talk I gave about it: slides https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf, video https://vimeo.com/843540328). The basic idea is that eclasses hold sets of operator expressions (think sea-of-node IR), and we keep that alongside the CFG with a "skeleton" of side-effecting ops. We build that, do rewrites, then extract out of it back to a conventional CFG-of-basic-blocks for lowering. The "equivalence" part comes in when doing rewrites: the main difference between an egraph and a conventional sea-of-nodes IR with rewrite rules is that one keeps all representations in the equivalence class around, then chooses the best one later.

We had to solve a few novel problems in working out how to handle control flow, and we're still polishing off some rough edges (search recent issues in the repo for egraphs); but we're mostly happy how it turned out!

(Disclosure: tech lead of CL for a while; the e-graphs optimizer is "my fault")

rishav_sharan

2 replies

1d2h

2024-03-18 15:44:08 UTC

It sucks that there is no way to use cranelift from outside of rust to create your own toy language. I would have loved to use cranelift in a toy compiler, but I am not ready to pay the Rust price of complexity.

sigsev_251

0 replies

1d2h

2024-03-18 16:18:18 UTC

There is always qbe if you need something simpler

kibwen

0 replies

1d2h

2024-03-18 16:22:23 UTC

As far as I know, you should be able to generate textual CLIF files and feed them to Cranelift: https://github.com/bytecodealliance/wasmtime/blob/main/crane...

namuol

2 replies

19h24m

2024-03-18 23:01:33 UTC

Is there no native support for M1-M3 Macs currently, and no Windows support either?

Unclear what the roadmap is there, as this update from the most active contributor is inconclusive:

Windows support has been omitted for now. And for macOS currently on supports x86_64 as Apple invented their own calling convention for arm64 for which variadic functions can’t easily be implemented as hack. If you are using an M1 processor, you could try installing the x86_64 version of rustc and then using Rosetta 2. Rosetta 2 will hurt performance though, so you will need to try if it is faster than the LLVM backend with arm64 rustc.

Source is from Oct 2023 so this could easily be outdated, but I found nothing in the original article: https://bjorn3.github.io/2023/10/31/progress-report-oct-2023...

sanxiyn

0 replies

18h9m

2024-03-19 00:16:25 UTC

Windows is supported. See https://github.com/rust-lang/rustc_codegen_cranelift/issues/....

bhaney

0 replies

19h10m

2024-03-18 23:14:43 UTC

https://github.com/rust-lang/rustc_codegen_cranelift#platfor...

cube2222

2 replies

1d4h

2024-03-18 13:42:58 UTC

Slightly off-topic, but if you fancy writing compilers in your free time, Cranelift has a great Rust library[0] for doing code generation - it’s a pleasure to use!

[0]: https://docs.rs/cranelift-frontend/0.105.3/cranelift_fronten...

loeg

1 replies

1d2h

2024-03-18 15:51:38 UTC

I've seen it used for really simple JIT in Advent of Code puzzles.

PartiallyTyped

0 replies

11h10m

2024-03-19 07:14:47 UTC

Why am i not surprised and even find this amusing? :D

ad-ops

2 replies

1d2h

2024-03-18 15:26:01 UTC

I see that there are many comments on full debug builds, but for me the most important difference are incremental build times when making minor changes. In my opinion this is what speeds up the development iterations.

Here are my build times when making a trivial change to a print-statment in a root function, comparing nightly dev vs adding cranelift + mold for rust-analyzer[0] (347_290 LoC) and gleam[1] (76_335 LoC):

    $ time cargo build
    Compiling rust-analyzer v0.0.0 (/home/user/repos/rust-analyzer/crates/rust-analyzer)
    # nightly
    Finished `dev` profile [unoptimized] target(s) in 6.60s
    cargo build  4.18s user 2.51s system 100% cpu 6.650 total
    
    # cranelift+mold
    Finished `dev` profile [unoptimized] target(s) in 2.25s
    cargo build  1.77s user 0.36s system 92% cpu 2.305 total

    Compiling gleam v1.0.0 (/home/user/repos/gleam/compiler-cli)
    # nightly
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 4.69s
    cargo build --bin gleam  3.02s user 1.74s system 100% cpu 4.743 total

    # cranelift+mold
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.99s
    cargo build --bin gleam  0.71s user 0.20s system 88% cpu 1.033 total

For me this is the most important metric and it shows a huge improvement. If I compare it to Go building Terraform[2] (371_594 LoC) it is looking promising. This is a bit unfair since it is the release build for Go and this is really nice in the CI/CD. Love Go compilation times and I thought it would be nice to compare with another language to show the huge improvements that Rust has made.

  $ time go build
  go build  3.62s user 0.76s system 171% cpu 2.545 total

I was looking forward to parallel front-end[3], but I have not seen any improvement for these small changes.

[0] https://github.com/rust-lang/rust-analyzer

[1] https://github.com/gleam-lang/gleam

[2] https://github.com/hashicorp/terraform

[3] https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

*edit: code-comments & links + making it easier to see the differences

kapilsinha

0 replies

23h15m

2024-03-18 19:09:54 UTC

This is most definitely self-promotion (I am the author), but also extremely relevant if you are looking at incremental build speed-ups: https://news.ycombinator.com/item?id=39743431

ad-ops

0 replies

1d2h

2024-03-18 16:15:34 UTC

For completion here is the compilation time for a small project from the axum/examples[0] (125 LoC), also comparing nightly dev vs adding cranelift + mold:

    $ time cargo build
    Compiling example-todos v0.1.0 (/home/user/ws/rust/example-todos)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.65s
    cargo build  1.49s user 0.58s system 123% cpu 1.685 total

    Compiling example-todos v0.1.0 (/home/user/ws/rust/example-todos)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.55s
    cargo build  0.47s user 0.13s system 102% cpu 0.590 total

[0] https://github.com/tokio-rs/axum/tree/main/examples/todos

Tehnix

1 replies

1d4h

2024-03-18 14:04:18 UTC

Very excited for Cranelift for debug builds to speed up development iteration - in particular for WASM/Frontend Rust where iteration speed is competing with the new era of Rust tooling for JS which lands in the sub 1 second builds sometimes (iteration speed in Frontend is crucial).

Sadly, it does not yet support ARM macOS, so us M1-3 users will have to wait a bit :/

kapilsinha

0 replies

21h55m

2024-03-18 20:30:25 UTC

But usually, at least I don't build an executable while iterating. And if I do, I try to set up the feature flags so the build is minimal for the tests I am running

Someone

1 replies

16h47m

2024-03-19 01:37:38 UTC

FTA: “Because optimizations run on an E-graph only add information in the form of new annotations, the order of the optimizations does not change the result. As long as the compiler continues running optimizations until they no longer have any new matches (a process known as equality saturation), the E-graph will contain the representation that would have been produced by the optimal ordering of an equivalent sequence of traditional optimization passes […] In practice, Cranelift sets a limit on how many operations are performed on the graph to prevent it from becoming too large.”

So, in practice, the order of optimizations can change the result? How easy is it to hit that limit?

HelloNurse

0 replies

1h54m

2024-03-19 16:30:36 UTC

For example, simplification of arithmetic expressions and propagating constants need to be combined (discovery of constants can allow further simplification, simplification can reach constant results).

With destructive updates you have to decide what variant has the last word (e.g. 2*a or a+a or a<<1), while an equality graph collects progress without heuristic and imprecise choices allowing equality saturation,

Dowwie

1 replies

1d4h

2024-03-18 13:26:22 UTC

Would it be naive to assume a general compile-time reduction of 20% for all Rust projects by swapping llvm with cranelift?

mrklol

0 replies

1d4h

2024-03-18 14:01:26 UTC

There are still projects out there which won’t hit 20% or even some %, if the bottleneck isn’t code gen (for example). So the part with"all" can be wrong, but beside that 20% is a good number.

Deukhoofd

1 replies

1d4h

2024-03-18 13:30:50 UTC

Does anyone by chance have benchmarks of runtime (so not the compile time) when using Cranelift? I'm seeing a mention of "twice as slow" in the article, but that's based on data from 2020. Wondering if it has substantially improved since then.

cfallin

0 replies

17h1m

2024-03-19 01:23:36 UTC

There are some benchmarks of Cranelift-based Wasm VMs (Wasmtime) vs. LLVM-based Wasm VMs here: https://00f.net/2023/01/04/webassembly-benchmark-2023/

The (perhaps slightly exaggerated but encouraging to me at least!) money quote there is:

That’s right. The cranelift code generator has become as fast as LLVM. This is extremely impressive considering the fact that cranelift is a relatively young project, written from scratch by a very small (but obviously very talented) team.

In practice anywhere from 10%-30% slower maybe is reasonable to expect. Compiler microbenchmarks are interesting because they're very "quantized": for any particular benchmark, often either you get the right transforms and achieve the correct optimized inner loop, or you don't. So the game is about getting more and more cases right and we're slowly getting there.

(disclosure: I was tech lead of Cranelift in 2020-2022)

tsegratis

0 replies

1d2h

2024-03-18 15:48:43 UTC

I feel like im reading advertising blurb reading that article

I wish them every success, but i hope for a more balanced overview of pros and cons rather than gushing praise at every step...

rayiner

0 replies

1d1h

2024-03-18 16:42:25 UTC

Very interesting article. I had not heard of equality graphs before. Here's some pretty good background reading on the subject: https://inst.eecs.berkeley.edu/~cs294-260/sp24/2024-03-04-eq...

pjmlp

0 replies

1d5h

2024-03-18 13:08:49 UTC

Finally, looking forward for wider adoption.

mmoskal

0 replies

1d1h

2024-03-18 17:00:01 UTC

JIT compilers often use techniques, such as speculative optimizations, that make it difficult to reuse the compiler outside its original context, since they encode so many assumptions about the specific language for which they were designed.

The developers of Cranelift chose to use a more generic architecture, which means that Cranelift is usable outside of the confines of WebAssembly.

One would think this has more to do with Wasm being the source language, as it's fairly generic (compared to JS or Python), so there are no specific assumptions to encode.

Great article though. It's quite interesting to see E-matching used in compilers, took me down a memory lane (and found myself cited on Wikipedia page for e-graphs).

metadat

0 replies

1d3h

2024-03-18 15:24:14 UTC

The Equality Graphs link [0] led me to discover ESC/Java [1] [2]. Has anyone actually tried or had any success with ESC/Java? It's piqued my curiosity to compare with Spot bugs (formerly known as Findbugs).

[0] https://en.wikipedia.org/wiki/E-graph

[1] https://en.wikipedia.org/wiki/ESC/Java

[2] https://www.kindsoftware.com/products/opensource/escjava2/

makemake_kbo

0 replies

3h3m

2024-03-19 15:21:54 UTC

imo rust debug builds are fast enough, but its nice to see things are going to get even faster! hopefully this will eventually make `rust-analyzer` faster and more efficient.