The page is the result of this exchange on Twitter:
https://twitter.com/gorilla0513/status/1784756577465200740
I was surprised to receive a reply from you, the author. Thank you :) Since I'm a novice with both compilers and databases, could you tell me what the advantages and disadvantages are of using a VM with SQLite?
https://twitter.com/DRichardHipp/status/1784783482788413491
It is difficult to summarize the advantages and disadvantages of byte-code versus AST for SQL in a tweet. I need to write a new page on this topic for the SQLite documentation. Please remind me if something does not appear in about a week.
There are three approaches:
1. interpreted code
2. compiled then interpreted bytecode
3. compiled machine code
The further up, the simpler.
The further down, the faster.
Performance analysis indicates that SQLite spends very little time doing bytecode decoding and dispatch. Most CPU cycles are consumed in walking B-Trees, doing value comparisons, and decoding records - all of which happens in compiled C code. Bytecode dispatch is using less than 3% of the total CPU time, according to my measurements.
So at least in the case of SQLite, compiling all the way down to machine code might provide a performance boost 3% or less. That's not very much, considering the size, complexity, and portability costs involved.
A key point to keep in mind is that SQLite bytecodes tend to be very high-level (create a B-Tree cursor, position a B-Tree cursor, extract a specific column from a record, etc). You might have had prior experience with bytecode that is lower level (add two numbers, move a value from one memory location to another, jump if the result is non-zero, etc). The ratio of bytecode overhead to actual work done is much higher with low-level bytecode. SQLite also does these kinds of low-level bytecodes, but most of its time is spent inside of high-level bytecodes, where the bytecode overhead is negligible compared to the amount of work being done.
Looks like SQLIte bytecode is similar to IBM RPG language from "IBM i" platform, which is used directly, without translation from SQL :)
Edit: PRG->RPG.
I'm curious. Would you have a pointer to the documentation of PRG language?
Oooops, it is IBM RPG, not PRG! My bad!
There are some links in Wikipedia.
I never used it, but only read big thread about SQL vs RPG on Russian-speaking forum, and there was a ton of examples from person who works with IBM i platform in some big bank. Basic operations look like SQLite bytecodes: open table, move cursor after record with key value "X", get some fields, update some field, plus loops, if's, basic arithmetic, etc.
For those that aren’t familiar, IBM i is ye olde A/S 400.
To be fair, RPG has been continuously updated and the version I’ve seen has Java integration.
It’s somewhat amusing that a single developer using Python can run circles around entire teams of RPG programmers. Technology marches on. :)
This logic doesn't seem sound? Because the application now spends 3% on bytecode dispatch doesn't tell us anything about how long it would instead spend on e.g. interpreting SQL.
He’s saying in the best case scenario that 3% would go to 0. Therefore the reality would probably be even less.
Unfortunately you can’t do performance analysis this way but I think the overall point that it’s a fraction probably still stands as you’d expect the I/O work to dominate.
The reason you can’t do the analysis the way that you explicitly stated (which fairly is what was implied) is that when you lower the code to machine code you typically get rid of branches in the execution path. Since branches slow down execution by a disproportionate amount vs how long they themselves take, it’s easy to get a >3% boost getting rid of code paths that seem like there’s only 3% of room.
What the author also failed to mention is that they’ve gone on many optimization hunts eeking out less than a percent here or there to get a cumulative boost, so 3% isn’t anything to sneeze at.
That being said, the broader point which is valid is that the maintenance isn’t worth the performance profile that SQLite targets not to mention that JITs open up an attack vector for security exploits. So the cost vs benefit is squarely in the interpreted side for now.
Edit: Forgot to mention that performance tuning a JIT to work well is also hard - for small queries you’ll spend more time compiling the byte code than you would just executing. That’s why all the big JS engines do a tiered approach where each tier optimizes a bit more.
A while back was musing if it was possible to come up with something resembling the instruction set for a CPU for an abstract relational-engine. Is this basically what SQLite is doing with bytecodes?
No. The SQLite bytecode is much higher-level than the instruction set of a CPU.
It is possible to construct a worst case scenario for bytecode execution, for example very complex expressions in the WHERE and/or SELECT clauses that compute values, and a query plan that performs a full table scan over a table with say 100 million rows that is cached in RAM (or maybe use generate_series, whatever works best).
Computing the expressions should dominate execution time, right?
Then, to compare against the best possible case, we can write a custom C program that uses sqlite internals to perform the same task (full scan of the table, extract values from row) and does not use the bytecode VM and computes the complex expressions in regular C code (e.g. a function that accepts floats and returns a float or whatever).
Then comparing the two implementations will tell us how much faster sqlite can be if it had a "perfect" JIT.
Part of the benefit of compiling bytecode (or anything) is specializing code to the context (types, values, etc) in which it appears. While I don't doubt your analysis, it could be the case that compiled C code in question is full of branches that can be folded away when specialized to the context of the query, such as the structure of the rows, the type and values of columns, etc.
Basically all of what you are saying about high-level bytecodes applies to dynamic languages, too. But they benefit highly from specializing each bytecode given static and dynamic context, and shortcutting dataflow through local variables.
That's right. "Bytecode" is a spectrum. SQLite's bytecode is higher-level than e.g. WASM, JVM, Python, etc. (Notably, because the original source code is higher-level.)
4. lots of small building blocks of static machine code precompiled/shipped with DB software binary, later iterated & looped through based on the query plan the optimizer came up with. Oracle does this with their columnar/vector/SIMD processing In-Memory Database option (it's not like LLVM as it doesn't compile/link/rearrange the existing binary building blocks, just jumps & loops through the existing ones in the required order)
Edit: It's worth clarifying that the entire codebase does not run like that, not even the entire plan tree - just the scans and tight vectorized aggregation/join loops on the columns/data ranges that happen to be held in RAM in a columnar format.
You might call those interpreter operations.
Isn't what you're describing just an interpreter, bytecode or otherwise? An interpreter walks through a data structure in order to determine which bits of precompiled code to execute in which order and with which arguments. In a bytecode interpreter the data structure is a sequential array, in a plain interpreter it's something more complicated like an AST.
Is this just the same as "interpret the query plan" or am I missing something?
That’s a really cool idea! Is there any writeups or articles about this in more detail?
You can also compile to bytecode (when building), and then compile that bytecode to the machine code (when running). That way, you can take advantage of the exact processor that is running your program.
This is the approach taken by .NET and Java, and I presume most other runtime environments that use bytecode.
There is also the option of native code generation from bytecode at install time as opposed of runtime.
Aka missing out on all the delicious profiling information approaches with later translation enjoy. There's no simple best answer.
I'm probably wrong, but I always thought for platforms like .Net and JVM, AoT delivers the same performance as JIT in most cases. Since a lot of information is available before runtime, unlike JS where VM always needs to be read, ditch optimized byte code and go back to the whiteboard.
For simple functions which are not called repeatedly you have to invert this list - interpreted code is fastest and compiling the code is slowest.
The complied code would still be faster if you excluded the compilation time. It's just that the overhead of compiling it is sometimes higher than the benefit.
I wonder in actual business scenarios isn't the SQL fully known before an app goes into production? So couldn't it make sense to compile it all the way down?
There are situations where the analyst enters SQL into the computer interactively. But in those cases the overhead of compiling does not really matter since this is infrequently done, and there is only a single user asking for the operation of running the SQL.
There is also JIT bytecode.
Yes, the timing of that compilation of bytecode and machine code can be either AOT or JIT.
For example, Java/JVM compiles bytecode AOT and compiles machine code JIT.
APL interpreters are tree-walking, but they're backed by C procedures. Then GCC optimizes quite well and you get excellent performance! With stuff like vector instructions.
Getting on par with GCC/Clang with your own JIT is pretty hefty.
You can also parse the code into an augmented syntax tree or code DOM and then directly interpret that. This approach eliminates the intermediate code and bytecode machine at the cost of slower interpretation. It's slower due to memory cache and branch prediction issues rather than algorithmic ones.
optimizing runtimes (usually bytecode) can beat statically compiled code, because they can profile the code over time, like the JVM.
... which isn't going to close the cold startup execution gap, which all the benchmarks/lies/benchmarks will measure, but it is a legitimate thing.
I believe Intel CPUs actually sort of are #2. All your x86 is converted to microcode ops, sometimes optimized on the fly (I remember Intel discussing "micro ops fusion" a decade ago I think).
I am amazed that the author (D Richard Hipp) made an effort to find and respond to a tweet that was (1) not directed/"@tted" at him or (2) written in his native language of English (original tweet is in Japanese[1]).
[1] https://twitter.com/gorilla0513/status/1784623660193677762
He might have a google alert for sqlite.org