SQLite 3.45 released with JSONB support

Embarrasing question tbh but with all the cloud-native sqlite stuff like cloudflare d1 and fly LiteFS I'm seriously thinking of switching from postgres to sqlite.

Does anyone have a compare/contrast sort of thing between the two?

For the love of all that is holy, if you do, only use STRICT tables. By default [0], SQLite will happily accept that not only can an INTEGER column store “1234” (helpfully silently casting it first), but “abcd” can also be stored in the column as-is.

There are other horrors in the link.

[0]: https://www.sqlite.org/quirks.html

I actually kinda like the fact that whatever data you write to the table will actually be written.

I semi-regularly fix a serious data loss bug that has been fixed with an alter table query. Maybe converting VARCHAR to TEXT or INT to BIGINT... of course it doesn't really "fix" your problem, because the data has already been lost/truncated.

What's a real world situation where completely the wrong type could be written to a column? Especially in modern software with good type safety checks/etc to ensure you don't have malicious data inserted into your database? If I ever did have that happen... at least the data hasn't been lost. You can run a simple script to clean up the "horrific" data.

I don't find a DB that losslessly stores what I told it to store regardless of types worrying at all.

So in fact AFAIC the misfeature of SQLite is not that it's typeless, IMO, rather it's that it has this notion of NUMERIC affinity that's all but lossless.

E.g. SQLite has a decimal extension that allows you to work with decimal numbers represented as TEXT, and so is appropriate to handle money without rounding issues. However, if you have a column where the declared type is DECIMAL, MONEY, NUMBER, NUMERIC or whatever it will have NUMERIC affinity. Then if you store a textual decimal number to it, it will deduce it looks like a FLOAT and convert, loosing precision.

Your only solution is to use BLOB affinity (declare no type), which is what I do, most of the time.

This is precisely the issue. Databases should not guess at what you want, nor be helpful and make your query work with incorrect types specified.

Schema is rigid; that’s the point. If the input is incorrect, log an error.

It's a bit hard to take your objection very seriously when the following spits out the same SQLite as in PostgreSQL:

    CREATE TABLE tbl (x integer, y real);
    INSERT INTO tbl VALUES ('001', ' 2.5 ');
    SELECT * FROM tbl;

    1|2.5

Numeric affinity was "invented" in SQLite to make it more compatible with PostgreSQL (et al.).

What's a real world situation where completely the wrong type could be written to a column?

* App with incorrect schema definition is deployed

* App written in TS or Python has type checks disabled

* App written in JS does math, and its fun parsing system decides that 1 + “1” == “11”

I’ve seen all of these.

You can run a simple script to clean up the "horrific" data.

Depends on scale, and the tables / columns. If there are billions of rows, and the columns with incorrect data aren’t indexed… that’s a bad time.

For me, the biggest trade offs for sqlite are just that you need to think about how you're going to store and backup the database a lot more. Specifically in container orchestration environments like kubernetes, I think sqlite presents a couple of challenges. With MySQL, you can set up a replicated database server instance outside of the cluster that you just connect to over the network, and you can use standard MySQL tools like mysqldump to back them up. Kubernetes isn't ideal for stateful workloads so that tends to be one of the more sane solutions there.

With SQlite you need to set up a persistent volume to keep the database around between container restarts, and you need to think of a clever way to then back up that sqlite database to somewhere like S3, likely using an sqlite3 command with a VACUUM statement and then an `aws s3 cp` command, which requires AWS credentials. Overall, a lot of additional work and privileges on the application container, at least in container orchestration environments. In lieu of all that, maybe you trust your persistent volume provisioner enough to try to do an online snapshot, but that always sketches me out / I don't trust the backup enough to rely on it.

Of course you can use a public cloud sqlite service like Cloudflare D1, but I haven't used that solution enough to say if it would be flexible enough to work with, say, an on-prem application server, or if it only works with Cloudflare workers. I'm sure I could find that out in the documentation but I've exhausted my mental stamina for the day with leafing through documentation pages.

Backing up sqlite databases is straightforward. `.backup` is a command that you use in sqlite for this purpose. Since you already have a volume for the database, you can backup to that same volume (if offloading to S3/etc is too much work).

Note that sqlite `.backup` does not back up PRAGMA values. Some people use `PRAGMA user_version` for schema versioning (not a good idea, because of this trap).

Yeah, my point was pretty specific to container environments, that using sqlite forces you to add a bunch of sqlite handling logic to your application. Whereas with MySQL (and other similar RDBMS’), you can have your application just worry about its own logic, and handle MySQL backups completely separately from it.

Since you already have a volume for the database, you can backup to that same volume (if offloading to S3/etc is too much work).

I think I just accidentally didn't see this part of your reply or something last night. But backing up a database to the same place the primary one is stored is decidedly _not_ a real backup solution. Nobody should believe that offers them any of the same forms of protection that a real backup would give them. At the absolute best, you're protected from your database getting malformed by your application server. But if you lose that volume for any reason, your backups are just gone. Imagine explaining to your boss in such a scenario that the solution that was come up with was that the backups are kept on the same linux partition as the primary running database. They would fire me. I would fire me.

It sounds great until you need a centralized billing database and then you might want to just stick with postgres rather than run two kinds of database.

Does anyone have ideas on how to solve that? Not to mention complicating migrations. Unfortunately sqlite-based product docs seem to end right before getting to the hard stuff. Or perhaps I missed them.

We use Alembic migrations with SQLite. No complaints.

Does Alembic provide a simple solution to the limitations of SQLite's alter table?

You're right. Sorry. I shouldn't post when I'm tired. This is listed as a limitation in Alembic, from memory.

Edit: I looked[0]. While it is a SQLite limitation, Alembic does seem to have a way to work around it.

[0] https://alembic.sqlalchemy.org/en/latest/batch.html#working-...

I was asking, not correcting you!

Django ORM also uses the same copy and move method but it warns you against doing it on production databases: "same copy and move method but it warns you against doing it on production databases".

That makes me reluctant to use SQLite for a use case it is otherwise well suited to: multi-tenant applications.

Alembic does not have such dire warnings, but still looks problematic with regard to constraints?

copying and pasting from a different thread:

I use SQLite/Litestream for https://extensionpay.com! Serves about 120m requests per month (most of those are cached and don't hit the db), but it's been great!

I was convinced that SQLite could be a viable db option from this great post about it called Consider SQLite: https://blog.wesleyac.com/posts/consider-sqlite

Using SQLite with Litestream helped me to launch the site quickly without having to pay for or configure/manage a db server, especially when I didn't know if the site would make any money and didn't have any personal experience with running production databases. Litestream streams to blackblaze b2 for literally $0 per month which is great. I already had a backblaze account for personal backups and it was easy to just add b2 storage. I've never had to restore from backup so far.

There's a pleasing operational simplicity in this setup — one $14 DigitalOcean droplet serves my entire app (single-threaded still!) and it's been easy to scale vertically by just upgrading the server to the next tier when I started pushing the limits of a droplet (or doing some obvious SQLite config optimizations). DigitalOcean's "premium" intel and amd droplets use NVMe drives which seem to be especially good with SQLite.

One downside of using SQLite is that there's just not as much community knowledge about using and tuning it for web applications. For example, I'm using it with SvelteKit and there's not much written online about deploying multi-threaded SvelteKit apps with SQLite. Also, not many example configs to learn from. By far the biggest performance improvement I found was turning on memory mapping for SQLite.

Happy to answer any questions you might have!

Out of curiosity, have you ever needed to restore a litestream backup?

Nope!

The SQLite omitted features page is a good place to start:

https://www.sqlite.org/omitted.html

and the gotchas page it links to.

It is a lot less feature rich than Postgres so there are things you will miss. Nothing like the range of types, I do not think it has a transactional DDL which is nice to have for migrations, and there are various other things like exclusion constraints and the different index types.

On the other hand SQLite may do all you want and not having to run and configure a separate server is a huge deployment advantage.

In my tests sqlite was around 10X faster than postgres.

That mean that a single cheap server is capable of going very very far in normal web workloads.

I think we often add complexity: virtualisation, cloud, separated db server, horizontal scaling when efficient and simple tech is able to go very very far.

From the original forum post [0] announcing this improvement:

But if you modify your application to start storing JSONB instead of text JSON, you might see a 3-times performance improvement, at least for the JSON-intensive operations. JSONB is also slightly smaller than text JSON in most cases (about 5% or 10% smaller) so you might also see a modest reduction in your database size if you use a lot of JSON.

I for one am excited about these improvements (specifically the disk use reduction) since we store a lot of JSON here at Notion Labs, and we’re increasing our use of SQLite.

[0]: https://sqlite.org/forum/forumpost/fa6f64e3dc1a5d97

Curious how you're using SQLite at Notion, do you have anything public?

Probably one per workspace/tenant ?

Who is doing this and where can I read more? What are the tradeoffs?

I imagine that you get a a dataset that is significantly smaller but it is much trickier to keep a dataset in memory the way you could with MySQL.

It's like having a free implicit index on the customer (because you had to lookup the sqlite db file before you could start querying).

I spend a lot of time thinking about tenancy and how to handle it. Tenancy is such a common problem.

Performance is the number one reason tickets are hard to estimate. The second in my experience is security.

Time and tenancy are the number one opportunities for SQL to just be better (I always need tenancy and my Order By or at least one constraint can typically be satisfied with time).

Salesforce uses a (postgres) database per tenant ('org'). Imo db per tenant is the way to go for most SaaS problems.

The ease of backup/restore, the extra layer of separation, the ability to 'easily' move something off cloud -> on premises or local

I forget that this has been a very popular strategy for a long time with more traditional databases.

Also it should, in theory, make it nicer to do gradual rollouts/migrations.

dhh is doing it: https://world.hey.com/dhh/multi-tenancy-is-what-s-hard-about...

I wonder if Turso https://turso.tech/ supports that use case. They support 10k databases in the step above free pricing tier.

I'm doing it, though I haven't written anything up. Happy to share my opinion though, with a bit more experience than you have.

The databases I'm working with are pretty small - ballpark 4MB of data per "tenant". So, I guess, a single large database sever with half a terabyte of RAM could keep well over a hundred thousand tenants in memory at the same time (I don't have anywhere near that many tenants, so I haven't tested that... and honestly if I did have that many I'd probably split them up between different servers).

Without getting stuck into the into too much detail - "tenant" isn't really a good fit for how we split them up. Our business is largely based on events that happen at a specific date, with maybe a few months of activity before that date. We have an sqlite database for each event (so ~4MB per event). Once the event passes, it's essentially archived and will almost never be accessed. But it won't actually never be accessed so we can't delete it.

I haven't run into any performance issues so far, just with regular sqlite databases on the filesystem. I expect the kernel is doing it's thing and making sure "hot" databases are RAM as with any other frequently accessed file on the disk.

My understanding (it's a theoretical problem I haven't actually encountered...) is SQLite only really struggles when you have a bunch of simultaneous writes. Our business model doesn't have that. The most actively written table is the one where we record credit card payments... and unfortunately we don't make tens of thousands of sales per second.

If we did have that "problem" I'm sure we could allocate some of our billions of dollars per day in profits to finding a way to make it work... my gut instinct would be to continue to use SQLite with some kind of cache in front of it. All writes would go to something faster than SQLite, then be copied to SQLite later. Reads would check the write cache first, and SQLite if the cache misses.

My experience working with a single large database is you end up with a lot of stale data that you is almost never needed. When a table has a hundred million rows, with indexes on multiple columns, even the simplest operating like adding a new row can get slow. My approach with SQLite eliminates that - I'll often have just hundreds of rows in a table and access is blazingly fast. When I need to access another database that hasn't been touched in a long time (years possibly), having to wait, what, an entire millisecond, for the SSD to load that database off the filesystem into memory isn't a big deal. No user is going to notice or complain.

Obviously that's more challenging with some data sets and if you're constantly accessing old data, those milliseconds will add up to significant iowait and things will fall over. I definitely don't use SQLite for all of my databases... but in general if you're doing enough writes for SQLite's simultaneous write performance issue to be a problem... then chances are your data set is going to get very large, very quickly, and you're going to have performance headaches no matter what database you're using.

Finding some way to divide your database is an obvious performance win... and SQLite makes that really easy.

Nothing public.

We’ve used SQLite in our native apps (including desktop) for years, like you’d expect. We’re considering how we could use it in the browser in a few ways now that OPFS and the ecosystem there are stabilizing. We’re also looking at some use cases server side, but not one-db-per-tenant.

I don’t think SQLite’s single-writer model would mesh well with Notion’s collaborative features. I’m actually very curious if the one-db-per-tenant concept turns out to be a good idea or a fad. To me it seems like a small app can very happily fit all their users on a single Postgres instance with much less orchestration effort, and a large app demanding of its database would hit the single-write lock thing.

If you want to know more, think about joining?? :) https://notion.so/careers or @jitl on Twitter

Fix a couple of harmless compiler warnings that appeared in debug builds with GCC 16.

Some projects use -Werror, only ever test with older GCC, and builds fail with anything recent. SQLite on the other hand anticipates the new compiler warnings of GCC 3 major versions in the future, that's impressive!

GCC 16!? Are they from the future?

I'm sure they mistyped '6' over '3' from numpad; it can happen to any of us.

While I find your explanation plausible, who has the muscle memory to press "g", "c", "c", spacebar, lift hand to numpad, "1", "6(no 3 sire!)", hand back to home row?

It's faster for me than looking down to figure out where the numbers are on the top row

If you used the top row more often, you wouldn't need to look down...

To be fair, you can be typing "gcc" with the left hand, while the right hand moves to the numpad.

Efficiency.

I would use both my hands

If anyone wants to try this out on macOS here's the fastest way I've found to try a new SQLite version there: https://til.simonwillison.net/sqlite/sqlite-version-macos-py...

Short version:

    cd /tmp
    wget 'https://www.sqlite.org/2024/sqlite-amalgamation-3450000.zip'
    unzip sqlite-amalgamation-3450000.zip
    cd sqlite-amalgamation-3450000
    gcc -dynamiclib sqlite3.c -o libsqlite3.0.dylib -lm -lpthread
    DYLD_LIBRARY_PATH=$PWD python3 -c "import sqlite3; print(sqlite3.sqlite_version)"

That prints "3.45.0" for me.

If you have https://datasette.io/ installed you can then get a web UI for trying it out by running:

    DYLD_LIBRARY_PATH=$PWD datasette

FWIW, this works for me with Python 3.12 from Homebrew, but not Python 3.12 from python.org. _sqlite3.cpython-312-darwin.so in Homebrew's Python appears to dynamically link /opt/homebrew/opt/sqlite/lib/libsqlite3.0.dylib, but the version in python.org's Python statically links the sqlite3 library.

EDIT: Python 3.9.6 from Xcode doesn't work either. It has _sqlite3.cpython-39-darwin.so which dynamically links /usr/lib/libsqlite3.dylib, but that dylib doesn't exist on my system, and I don't know enough about macOS internals to tell where it's coming from. The _sqlite3 so doesn't seem big enough to have it statically linked.

EDIT2: Xcode's Python works when launching via the real path instead of using the /usr/bin/python3 alias, I assume because /usr/bin is SIP-protected or something.

Thanks for that, I'll add a note to my TIL.

If you change the version, URL, and the sha256 in conda-forge/sqlite-feedstock//recipe/meta.yaml and send a PR, it should build end then deploy the latest version so that you can just `mamba install -y sqlite libspatiallite sqlite-utils` without also mamba installing gcc or clang. https://github.com/conda-forge/sqlite-feedstock/blob/main/re... https://github.com/conda-forge/sqlite-feedstock/blob/main/re...

If anyone wants to try this out on macOS here's the fastest way I've found to try a new SQLite version ...

https://sqlite.org/fiddle

is always updated as part of the release process and is updated periodically between releases.

and the easiest way for those who can wait for the next update will be to get the binaries from https://cosmo.zip/pub/cosmos/bin/datasette and https://cosmo.zip/pub/cosmos/bin/sqlite3

I find it really odd that the decision was made to store ints and floats as text in JSONB. It seems to defeat a lot of use cases for it as far as I can tell. There are few solutions for storing and retrieving/querying unstructured numerical data.

I don't know the rationale, but I would assume exact preservation of data maybe the case.

Does json (yaml/others) have well defined handling of ints/floats? Especially ints with more than 52bits set?

This sounds right. This JSONB seems to be oriented not around semantic parsing, just structural parsing. https://sqlite.org/draft/jsonb.html

Note too that JSON doesn’t really have ints or floats - the number type in JSON doesn’t specify a max size. Most implementations of course do rely on the native number types of their platform, so this implementation choice for SQLite allows them to keep a simpler implementation that sidesteps a lot of complexity that would come from deeper parsing.

See the number section on json.org https://www.json.org/json-en.html

The rational was to design a format that can serve as a (flattened) parse tree for the JSON.

JSON handling functions in SQLite took in textual JSON and mostly spat out JSON text. So their structure was: (1) parse JSON, (2) massage in memory representation, (3) serialize JSON.

If you can come up with a format that can serve as the in memory representation, and persist that to disk, your functions can skip (1) and (3), and focus on (2).

Still, many times you'll need JSON text at the boundary, so making (1) and (3) fast are needed to.

Parsing and formatting numbers can also be an unnecessary expense, if you do that more often than you actually need it.

IIRC Sqlite stores all data types as text. It seems like a very weird decision to me.

This is wrong. Integers (in SQLite columns) are stored as varints and floats as IEEE 754 doubles. Numbers in JSONB are stored as text.

It's still so weird to me that SQLite returns it's internal format to callers.

    create table foo (bar);
    insert into foo (bar) values ('{"answers": [42]}');
    select jsonb_extract(bar, '$.answers') from foo;

    ;#42

This is great.

If you find the official release notes a bit dry, I've made an interactive version:

https://antonz.org/sqlite-3-45

I thought I recognized this but its submission URL was goofy; previously discussed:

JSONB has landed - https://news.ycombinator.com/item?id=38540421 - Dec 2023 (205 comments)

The internal JSONB format is also uses slightly less disk space then text JSON.

The only thing missing with Json support now seems to be something like a gin index on jsonb fields to make querying efficiently on any member of the field itself.

Very welcome improvement. I overlooked SQLite for far too long relegating it to just a "toy database, not meant for real world apps". Boy was I wrong.

Trying to store JSON-like data in a way that's both compact and fast to operate on directly is a challenge. IIUC this is is something SQLite has wanted to introduce for a while, but it took them some time to find a viable approach.