return to table of content

I wrote a new JIT compiler for PostgreSQL

pinaraf
12 replies
5d5h

Author here. Thanks for submitting my article on hackernews. I'll do my best to answer any question.

winternewt
2 replies
5d2h

Is there a fundamental difference between copy and patch with C and what compilers do when they target intermediate representations? It seems to me that traditional compilation methods are also "copy and patch" but with another intermediate language than C.

tetha
1 replies
5d

I think conceptually, there is no real difference. In the end, a compiler outputting machine code uses very small stencils, like "mov _ _", which are rather simple to patch.

Practically though, it's an enormous difference, as the copy and patch approach re-uses the years of work going into clang / gcc supporting platforms, optimizations for different platforms and so on. The approach enables a much larger pool of people ("People capable of writing C" vs "People capable of writing assembly / machine code") to implement very decent JIT compilers.

pinaraf
0 replies
4d19h

The real difference is in the possible optimizations. If you consider the full scope of JIT compilation in for instance a web browser or the JVM, you could use copy and patch as a tier 0 compiler, and once really hot paths are identified, trigger a complete compiler with all the optimizer steps. Some optimizations are more complicated to implement with copy-patch, esp. if you can't use all the tricks described in the paper (for instance they use the ghccc calling convention to get a much finer register allocation, but from the documentation I don't think it's going to make it for PostgreSQL).

But as you say, yes, this enables people capable of writing C and reading assembly (or you have to be perfect and never have to go into gdb on your compiled code), and it makes the job so much faster and easier... Writing several machine code emitters is painful, and having the required optimization strategies for each ISA is quickly out of reach.

pgaddict
2 replies
5d5h

Would be a great topic for pgconf.eu in June (pgcon moved to Vancouver). Too bad the CfP is over, but there's the "unconference" part (but the topics are decided at the event, no guarantees).

mattashii
1 replies
5d3h

Did you mean pgconf.dev in May (which has the unconference), or pgconf.eu in October (which doesn't have an unconference, but the CfP will open sometime in the - hopefully near - future)?

pgaddict
0 replies
5d2h

Yeah, I meant May. Sorry :-( Too many conferences around that time, I got confused.

That being said, submitting this into the pgconf.eu CfP is a good idea too. It's just that it seems like a nice development topic, and the pgcon unconference was always a great place to discuss this sort of stuff. There are a couple more topics in the JIT area, so having a session or two to talk about those and how to move that forward would be beneficial.

o11c
1 replies
4d12h

Is copy-and-patch really a new idea, or just a new name for an old idea?

When I learned programming (and interpreters particularly) around 2010, I thought it was well-known that you could memcpy chunks of executable code that your compiler produced if you were careful ... the major gotcha was that the NX bit was just starting to take off at the time (Even on Linux, most people still assumed 32-bit distros and might be surprised that their CPUs even supported 64-bit. At some point I ended up with a netbook that didn't support 64-bit code at all ...).

Unfortunately I ended up spending too much time on the rest of the code to actually look deeply enough into it to build something useful.

can3p
1 replies
5d4h

Nice post, thanks! Do I read it right that using jit results in the worst max times? What could be a reason in your opinion?

pinaraf
0 replies
5d4h

Two parts: I did the benchmark on a laptop and didn't spend enough time forcing its runtime PM in a fixed state, I'll run a real pgbench on my desktop once I implement all required opcodes for it. And since JIT requires a minimum amount of time (about 300us on my tests), on such small runtimes this can quickly overcome the benefits.

frizlab
0 replies
5d5h

Not a question, but I love this. I’m eager to see its evolution.

aengelke
0 replies
4d4h

Thanks for the blog post, it's always nice to see performance improvements in Postgres! I'm curious: how much time is spent for LLVM on some real queries and how is LLVM configured (i.e., which passes, which back-end optimization, etc.)? In our experience [1], LLVM can be reasonably fast when optimized for compile time without optimizations and the -O0 back-end pipeline, but obviously still 10-20x slower compared to other approaches.

Also, in our experience, copy-and-patch-generated code tends to be quite slow and hard to optimize (we tried some things [2; Sec. 5], but there's still quite a gap (see Fig. 3 for a database evaluation)). Do you have some numbers on the run-time slowdown compared to LLVM? Any future plans for implementing multi-tiering, so dynamically switching from the quickly compiled code to LLVM-optimized code?

[1]: https://home.in.tum.de/~engelke/pubs/2403-cgo.pdf [2]: https://home.in.tum.de/~engelke/pubs/2403-cc.pdf

adzm
10 replies
5d4h

I'm still surprised there isn't a query/plan cache for PostgreSQL. I could easily see these two approaches working in harmony once it does, as frequent queries could end up being cached and more aggressively optimized with a cache to offset the compilation cost. Of course that adds a whole new layer of complexity and trouble.

williamdclt
3 replies
5d4h

(The article goes a bit above my head so my excuses if I am a bit off-topic)

There is a form of query plan caching in PG: for prepared statements, if PG determines that the actual value of parameters won't affect the query plan much, it uses a "generic plan" so that it reuses the same query plan for every execution of the prepared statement (https://www.postgresql.org/docs/current/sql-prepare.html, see "notes")

pinaraf
1 replies
5d4h

Indeed, and right now it's the only possible way since it remains in a single session, doing otherwise would be very hard.

cbsmith
0 replies
5d1h

Unless you count stored procs...

SigmundA
0 replies
5d3h

Yes its manual and per session, DB's like MSSQL have that was well but are very rarely used anymore because it got automatic plan caching about 20 years ago which basically eliminates any advantage to manually preparing. Its actually better since it can be shared across all sessions

pinaraf
2 replies
5d4h

Honestly I thought the same as you, then I wrote this, and I now understand it's going to be really hard to do. To make it very simple: there are pointers to query parts "leaking" everywhere across the execution engine. Removing them will require a significant overall of the execution engine, the planner and who knows what else. Even in a single session, two compiled queries will have different compiled code because of that (both llvm and my copyjit have to inject the adresses of various structs in asm code)

adzm
1 replies
5d4h

Just going to say, I'm blown away by how simple this JIT is though. Really quite a beautiful JIT approach.

pinaraf
0 replies
5d4h

Same for me, that's why I did this after finding out this research paper. With the proper compiler settings and small tricks you can remove some parts and already end up faster than the interpreter (because you remove some branches and a few memory accesses) and it's even possible to create "super-stencils" covering typical opcodes series and optimizing them further. Or the opposite, "sub-stencils" in order to do some loop unrolling for instance.

aeyes
2 replies
5d3h

The plan cache on Oracle in combination with prepared statements where the optimizer can't peek into the parameters has been really problematic for me in the past. I usually had to go in and either add hints or force a plan.

Even simple queries like SELECT * FROM t WHERE x = TRUE; could turn into a nightmare depending on the distribution of the x values in the table.

With Postgres I rarely encountered such problems but I must admit that I haven't used Postgres with prepared statements.

I have seen some queries with slow planning time (>100ms) where a cache could have been useful but I don't remember ever really needing to optimize one.

SigmundA
1 replies
5d3h

Is x a parameter because it doesn't look like it? MSSQL has parameter sniffing and will make multiple plans based on incoming parameters I would be surprised if Oracle does not do the same. It can actually be problematic to sniff parameters sometimes and it can be disabled with a hint!

aeyes
0 replies
5d2h

I just made up something simple but yes, I had these problems with bind variables in prepared statements.

SELECT * FROM t WHERE x = :var;

But I haven't used Oracle in years and back in the day, there was no bind variable peeking.

weliveindetail
7 replies
5d4h

There is one way to make the LLVM JIT compiler more usable, but I fear it’s going to take years to be implemented: being able to cache and reuse compiled queries.

Actually, it's implemented in LLVM for years :) https://github.com/llvm/llvm-project/commit/a98546ebcd2a692e...

SigmundA
4 replies
5d3h

Since PG uses one process per connection and the LLVM JIT code is process specific the code can't be shared amongst all connections to the DB.

Plans themselves suffer from this since they are in memory data structures not designed to be shared amongst different processes.

DB's like MSSQL don't have this issue since they use a single process with threads which is also why it can handle more concurrent connections without an external pooler. Although MSSQL can also serialize plans to a non process specific representation and store them in the DB for things like plan locking.

hans_castorp
1 replies
5d2h

Plans themselves suffer from this since they are in memory data structures not designed to be shared amongst different processes.

Oracle uses a process-per-connection model as well (at least on Linux), and they are able to share execution plans across connections. They put all the plans into the "global" shared memory.

SigmundA
0 replies
5d2h

Looks like you can change that with THREADED_EXECUTION to make it act like it does on Windows with a single process and threads:

On UNIX, starting with Oracle Database 12c Release 2 (12.2), Oracle Database can use an operating system process or an operating system thread to implement each background task such as database writer (DBW0), log writer (LGWR), shared server process dispatchers, and shared servers.

The use of operating system threads instead of processes allow resource sharing and reduce resource consumption.

On Windows, each background process is implemented as a thread inside a single, large process.

https://docs.oracle.com/en/database/oracle/oracle-database/1...

Processes in Windows are much more expensive than Unix typically so using threads has always been preferred to multi process, perhaps thats why MSSQL only has that option with an almost fully recreated internal process model that you can list and kill etc.

Even Oracle says it helps with resource usage, even on Unix/Linux. Also looks like Oracle has had some kind shared mode for a long time where it basically has a built in pooler to keep actual OS process count down, not 1:1 like PG.

Sharing plans can obviously be done using shared memory but it's not a simple as just creating some C++ object model (which I believe is what PG has internally) for the plan it must have a process agnostic data format that is then executed probably by deserializing into a executable model from shared memory. Fully jitted code is even trickier vs just a set of logical plan operations. With threads you just share executable code.

hinkley
0 replies
4d22h

Way back on Oracle 9i, we had a mystery stall problem. We couldn’t saturate the network, the CPU, or the fiber channel links. We were stuck at ~50% and stumped. Some fuckery was going on and we had to call in a professional.

Turned out 9i could only run queries that currently resided in the query cache, and some idiot (who was now my boss) had fucked up our query builder code so that we were getting too many unique queries. Not enough bind variables.

So it’s clear Oracle was using a shared cache back then, but like other people here, I’m scratching my head how this would work with Postgres’s flavor of MVCC. Maybe share query plans when the transaction completes?

I feel like that would get you 90% of the way but with some head of queue nastiness.

asah
0 replies
4d4h

naive q: could one process perform the compilation, store the output in postgresql shmem, then each pg worker copy that code locally ?

can plans be shared ? I might be mistaken but I thought each worker backend can (e.g.) be assigned to a different partition in a partitioned table, with very different table indexes, statistics, etc.

pinaraf
1 replies
5d4h

Yeah, well, sorry, I should have been more explicit here: the issue is with PostgreSQL, not LLVM. The JIT compiler has to inject direct memory addresses, making the generated code specific to your query and process.

weliveindetail
0 replies
4d20h

Interesting, because we store relocatable objects. And process symbols can be resolved by name if you really want. It might be yet another performance trade-off though.

miohtama
0 replies
4d20h

Back in 386 era, there was a concept of self-modifying code (assembly). A similar like stencils presented here, but because code was a singleton, rarely a copy was made.

E.g. Doom on DOS used this optimisation techique, because otherwise you could not cram out enough performance from tight rendering loops on old CPUs.

HackerThemAll
0 replies
3d21h

Interesting work, please continue it. Great times ahead. Thanks!