return to table of content

SQLite: Wal2 Mode

miroljub
34 replies
7h42m

In wal2 mode, the system uses two wal files instead of one. The files are named "<database>-wal" and "<database>-wal2", where "<database>" is of course the name of the database file. When data is written to the database, the writer begins by appending the new data to the first wal file. Once the first wal file has grown large enough, writers switch to appending data to the second wal file. At this point the first wal file can be checkpointed (after which it can be overwritten). Then, once the second wal file has grown large enough and the first wal file has been checkpointed, writers switch back to the first wal file. And so on.

Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?

Anyways, looking forward to this mode reaching general availability.

WolfOliver
11 replies
7h12m

I'm not much into the details but I'm wondering why is it not always writing to "<database>-wal" once it is full, it is renamed to "<database>-wal-0" and it starts writing again into "<database>-wal", once it is full again it is renamed to "<database>-wal-1" ?

wongarsu
8 replies
6h17m

when doing backups, managing file permissions, etc, it's really convenient to only have a fixed number of known files to worry about.

Switching to WAL already makes handling Sqlite databases much less convenient, since you now have three files instead of one, and need a filesystem snapshotting mechanism to reliably back them up (so you don't have one state in the database and another in the wal). Making the filenames and number of files less predictable would make that mode not worth it for many use cases

dezgeg
4 replies
6h9m

Even the mechanism predating WALv1 (rollback journal) uses two files. I don't think there is any way of using SQLite crash-proof with just a single file.

Besides, even if the database is single-file it's still necessary to use filesystem snapshotting for live backup, or it's likely to get an inconsistent copy.

kijin
3 replies
5h43m

With smallish databases, just pipe the .dump command to gzip. No need to snapshot an entire filesystem just to back up a few hundred megabytes (what I would consider "smallish"). Backup and restore speeds aren't a significant concern at those sizes, anyway, and you get human-readable backups that can be inspected with standard Unix tools.

deberon
1 replies
5h30m

We would back up double-digit GB MySQL databases by piping mysqldump into gzip as well. Like you I’m sure there’s a size limit where that doesn’t work. I never found out if it was CPU or memory constrained though.

sgarland
0 replies
5h8m

Where I’m at, the DBAs shifted from pipes to temporary files. They said at the double-digit TB size, it would occasionally fail otherwise. This was with Postgres / barman / pigz, FWIW.

Never saw it myself, so I have no idea what the cause was.

maskros
0 replies
3h48m

Or use "VACUUM INTO" to get a backup that is is even faster (and easier) to restore.

sureglymop
0 replies
2h22m

You can't use a filesystem snapshotting mechanism... Even when not using wal mode, you can't just cp the .db file somewhere (in some circumstances, yes).

Instead, sqlite provides an online backup api specifically for creating backups. This also takes wal mode into account.

lupusreal
0 replies
3h51m

need a filesystem snapshotting mechanism to reliably back them up (so you don't have one state in the database and another in the wal)

VACUUM INTO?

formerly_proven
0 replies
4h17m

Pretty much every journaling mechanism requires snapshots for reliable backups. This includes either of SQLite's modes (undo journal and WAL).

slau
0 replies
7h3m

Presumably because you don't want to keep a full directory of WALs up to infinity. This uses WAL_size*2 on the disk and keeps it from growing beyond that.

ajnin
0 replies
6h42m

This is to handle the case where transactions keep getting written to the wal file while it is being "checkpointed", preventing it from ever being deleted. You only need to alternate between two files, one checkpointed, one written to, for this.

athoscouto
10 replies
5h55m

Checkpointers do not block writers, and writers do not block checkpointers. This means that, even following the checkpoint, the wal file cannot be overwritten or deleted, and so all subsequent transactions must also be appended to the wal file. The work of the checkpointer is not wasted - SQLite remembers which parts of the wal file have already been copied into the db file so that the next checkpoint does not have to do so again

Probably because of this.

but it does mean that the wal file may grow indefinitely if the checkpointer never gets a chance to finish without a writer appending to the wal file. There are also circumstances in which long-running readers may prevent a checkpointer from checkpointing the entire wal file - also causing the wal file to grow indefinitely in a busy system.

Wal2 mode does not have this problem. In wal2 mode, wal files do not grow indefinitely even if the checkpointer never has a chance to finish uninterrupted.

I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?

jacoblambda
8 replies
4h31m

I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?

Because with a single wal file you can't checkpoint it during a read since said file may change out from under you.

With two wal files, the one you are actively appending to can be treated like in wal1 mode but the one that isn't being appended to is immutable for the time being just like the main database.

This means you can treat the actual db file and the immutable wal file together as one immutable database file with some underlying abstraction. That abstraction then allows you to perform the checkpoint operation since the abstraction can keep all that immutable data accessible in some form or another while reworking the data structure of the db file.

Then once the checkpoint is complete, the abstraction can clear the now redundant immutable wal file, become transparent, and just present the underling single DB file.

And now once the wal file you are actively appending to reaches a sufficient size, you "lock" that one, rendering it immutable, and switch over to appending to the cleared wal file you were previously checkpointing. With this you can now checkpoint again without blocking reads or writes.

willvarfar
4 replies
3h30m

once the wal file you are actively appending to reaches a sufficient size, you "lock" that one

so there might eventually be wal3 and wal4 files and so on?

jacoblambda
2 replies
3h13m

nope. since if the wal you are checkpointing isn't done yet, you just wait to "lock" and switch files until that operation is complete.

Checkpointing can be considered "lock free" since the operation will always eventually complete. How long it takes will depend on the wal file being checkpointed into the db but it'll eventually complete in some finite amount of time.

Because you know that any given checkpointing operation has to eventually complete, you can simply keep appending to the current "append" wal file and then tackle those changes when you finish the current checkpoint op (at which point the wal file you just finished checkpointing is free to take the appends).

athoscouto
1 replies
3h4m

Wouldn't that require the reader to finish? Making the reader block the checkpoint from wrapping up?

jacoblambda
0 replies
2h12m

Not if you abstract those reads. It's not like the application is directly opening a file handle. Instead it goes through the sqlite library which accesses not only the db and the wal file but also a non-persistent shm (shared memory) file used for maintaining a consistent view of the write ahead log.

When a reader is reading, it puts a shared lock on the specific data it is reading in the shm file. The checkpointer respects that lock and may (potentially) continue working elsewhere in the db file, slowly updating the indices for checkpointed data in the shm file.

The checkpointer won't change the underlying data that the reader has a lock on but they may have created a new location for it. When the reader is finally done reading, the checkpointer can quickly grab an exclusive lock and update the header in the shm for that data to point to the new destination (and then release said lock). Since the checkpointer never holds this lock for very long, the reader can either block when trying to get a shared lock or it can retry the lock a few moments later. Now that the header in the shm only points to the new location, the checkpointer can safely do whatever it needs to with the data in the old location.

Slowly rinse repeat this until the checkpointer has gotten through the entire write ahead log. At that point there should be no remaining references in the shm to data within the wal file.

Now the wal file can be "unlocked" and if the other wal file is large enough, it can be locked, writes switch over to the other wal, and the cycle repeats anew.

Edit: Importantly, this requires that all readers be on a snapshot that includes at least one commit from the "new" wal file. So compared with wal1, wal2 allows you to have long running readers as long as they start past the last commit of the "previous" wal file.

riwsky
0 replies
3h9m

Think of them more as “walblue” and “walgreen”

liuliu
2 replies
3h12m

But Wal also keeps "history" so one reader transaction can see x0 from wal1, another reader can see x1 from Wal1. Wal1 cannot be merged into main db otherwise x0 is lost?

jacoblambda
1 replies
1h4m

Yes that is correct. You can't checkpoint data after the oldest snapshot. But to my knowledge there's no way to force a read snapshot to continue to exist.

So while you can make multiple back to back reads that use the same snapshot, I believe there's no guarantee that the snapshot will still exist when the next read is opened unless the previous read is also still open (in which case an error is returned).

That seems to set an upper bound on how long a reader can block a checkpoint (unless the reader is intentionally staging reads to block the checkpoint).

Theoretically you could implement checkpoints that flatten everything between snapshots into single commits but the complexity and overhead probably isn't worth it given that the only real blocker for wal2 is an edge case that is nigh impossible to encounter unless you intentionally try to trigger it.

liuliu
0 replies
24m

Open an BEGIN transaction forces read from a particular snapshot in SQLite. There are some complications around lock upgrade in that case: BEGIN a, read, BEGIN b, read, write from a transaction fail because cannot upgrade read lock to write lock. Other than that, in WAL mode, it will work exactly as expected:

BEGIN a, read x0 BEGIN b, write x1, END b BEGIN c, read will return x1 Back to a transaction, read again, return x0 still.

stavros
0 replies
4h48m

I suspect that having two WAL files means you can freely checkpoint the one that isn't currently being written to, but that's just a guess.

tomashubelbauer
4 replies
7h19m

Now I can't help but wonder if there should be a `waln` mode where the WAL files would round robin instead of alternate between just two potentially allowing for much more intense write cadence.

rmbyrro
2 replies
7h9m

That would make reading slower, since readers need to read from WAL as well.

nbevans
1 replies
6h53m

There is a .shm (shared memory) file that all SQLite connections to the database would use to determine which page ranges are in the WAL versus main DB file. So that overhead already exists when WAL is enabled.

rmbyrro
0 replies
5h8m

I'm not familiar with its inner workings, but if there were n WAL files, I'd expect the overhead to be a function of n.

avereveard
0 replies
7h8m

Il give this thread two or three more replies before it reimplement logrotate from first principles

forinti
3 replies
7h23m

I'm just speculating here, but in a normal database you would have different processes writing the wal files to the database or archives.

You don't have that with sqlite, so I don't see an obvious advantage for this, except if they now spawn a process or thread to do this concurrently.

Edit: so I read the doc (shame on me) and it has nothing to do with speed. Its purpose is to prevent a wal file from growing too large.

hahn-kev
1 replies
7h15m

I believe SQLite supports accessing a single database file from multiple processes or threads at once

arcbyte
0 replies
4h46m

There can be many readers simultaneously, but only ever one writer at a time.

WolfOliver
0 replies
7h11m

I guess it is to make it crash save, e.g. in case the server dies in the middle of a write, the database would be corrupt after restart, with the wal file it can recover.

quietbritishjim
0 replies
1h20m

It may simply have been a design oversight at the time. However, the page notes the following difference of WAL2 from WAL, which could be an issue in some cases:

In wal mode, a checkpoint may be attempted at any time. In wal2 mode, the checkpointer has to wait until writers have switched to the "other" wal file before a checkpoint can take place.
foofie
0 replies
4h52m
vdaea
15 replies
7h24m

In wal2 mode, the system uses two wal files instead of one. The files are named "<database>-wal" and "<database>-wal2",

Heh, I wonder how many people will delete the "wal" file thinking that, since they switched to wal2, the wal file must be a leftover.

worksonmine
9 replies
6h16m

If people just randomly delete files they don't fully understand on a production system maybe they should be bitten.

bsaul
4 replies
5h11m

in the case of sqlite though, the technology is often used as a standalone file format. So it is very tempting to consider the ".sqlite" file to be the one containing all the data, and all the rest to be temporary files that don't matter much.

IMHO this (having a variable number of files containing the data, depending on your configuration) is the only real design quirks of this technology.

resoluteteeth
1 replies
4h33m

in the case of sqlite though, the technology is often used as a standalone file format. So it is very tempting to consider the ".sqlite" file to be the one containing all the data, and all the rest to be temporary files that don't matter much.

If you're using it as a standalone file format you presumably shouldn't leave .sqlite files with associated wal files lying around in places where users are going to get confused by them, either by sticking to the rollback journal mode or by using some other method

fauigerzigerk
0 replies
3h13m

If you use sqlite as a standalone file format for an app that has user managed files then it is hard to avoid this confusion. Rollback journal mode creates a temporary file as well.

Also, a scheduled backup process might come along at any moment and non-atomically copy the database file and any -journal or -wal files.

Ideally, user visible files should survive copying at random points in time without corruption and without losing too much recent data.

Having read "How To Corrupt An SQLite Database File"[1], I'm still not quite sure how to achieve this.

[1] https://www.sqlite.org/howtocorrupt.html

worksonmine
0 replies
4h17m

The default mode for sqlite isn't WAL though so turning that on is a choice. End-users should have a safe mode of backup if they're not expected to understand the tech.

I know reading the manual isn't very common and people are lazy, but getting burned can be a useful and necessary lesson.

another2another
0 replies
4h59m

Indeed sqlite's original mission was "to be a replacement for fopen()", but as more features are being added it looks like that initial simplicity can't be maintained.

paulddraper
1 replies
3h14m

But in development...

worksonmine
0 replies
2h20m

...it can be excused because it's expected to be purged regularly and the developer may not even know databases very well.

macintux
0 replies
26m

~25 years ago I rebooted a Solaris server, only to discover that some previous person in my role apparently had, while scrounging around for things to delete to free up disk space on the root partition, removed the kernel file.

It happens.

(I'm really, really hoping the "previous" person wasn't me.)

IshKebab
0 replies
20m

"If people just randomly don't look at the road and crash then maybe they should die."

You make mistakes. Do you want them to be as painful as possible?

ceeam
3 replies
7h13m

Would it be a problem since the wal you delete, its inode, will still be open and processed at the DB closing normally? Just guessing, never tried that.

andix
2 replies
6h49m

There are cases where the wal file is not merged on shutdown of the application. I think a corrupted database can also prevent merging the wal file automatically. A corrupted database can often be repaired, but it needs to be done manually.

I've been bitten badly by that issue once. I just mounted the .db file into a docker container and didn't realize that sqlite creates wal files. On an non-graceful shutdown of the application the wal file was not merged into the db and the container deleted. And around a day of changes were lost.

Conclusion: Sqlite databases should be placed into their own folder, so it's obvious that it's not always just one file.

ricardobeat
1 replies
6h33m

This is configurable, and for small things you might disable WAL completely.

When using WAL, if you’re copying or backing up the database it’s possible to force a checkpoint, then you can copy the .db file alone knowing exactly up to when it contains data.

andix
0 replies
6h23m

I had to learn all that the hard way ;)

fbdab103
0 replies
45m

As opposed to the `-journal` file already created?

runiq
5 replies
6h12m

So it's basically double buffering, but for databases? That makes sense.

jbverschoor
3 replies
6h7m

Tripple

runiq
2 replies
4h56m

Hmm, I don't think I follow—what is the third buffer?

stavros
0 replies
4h44m

The main database file, the WAL, and the second WAL. Though I do agree with you that I also think about this as double-buffering, as I think of the database file as the screen.

masklinn
0 replies
4h41m

Database actual?

The buffering analogy doesn’t really work tho, because all three sources (db file, wal being flushed, and wal being written to) are read sources.

gregw2
0 replies
1h17m

In a high-level sense, yes! But it kind of depends on which part of the design pattern you mean by "double buffering for databases".

With double-buffering (2d/3d graphics) you are literally writing the final pixel-level data to the back buffer.[1]

In a database WAL scenario, to further analogize, it's more like you are writing the 2d/3d graphics commands to the buffer and executing them later. Because that is part of the point of the WAL -- it results in reduced disk writes because only the log file needs to be flushed to disk to guarantee a transaction is committed, rather than every data file/byte(/pixel) changed by the transaction.[2][3] (The WAL content is loosely a bit more like 3D (or 2D) vertex buffer objects/display lists [4] if you are familiar with those.)

Swapping the two WAL files though and alternating writing to each is yes like double buffering.

A third similar design pattern (to WALs) is used in operating systems' journaling filesystems[5] and actually was a contribution from OSes adopting database WAL techniques back in the 1990s.

Apologies if you know all this.

[1] https://en.wikipedia.org/wiki/Multiple_buffering#Double_buff...

[2] https://www.postgresql.org/docs/15/wal-intro.html

[3] https://en.wikipedia.org/wiki/Write-ahead_logging

[4] https://en.wikipedia.org/wiki/Display_list

[5] https://en.wikipedia.org/wiki/Journaling_file_system

grogers
5 replies
4h14m

How common is an automatically expanding WAL across other DBs? I'm most familiar with MySQL which uses a fixed size WAL (unless something changed recently). That of course comes with other tradeoffs like potentially blocking writes if checkpointing falls behind. But I'm curious if SQLite is an oddity in this respect compared to other DBs. Since it is used in embedded contexts it might prefer to save less data on disk in the common case, but with the edge case that wal2 mode now fixes.

mike_hearn
4 replies
3h26m

For example, RocksDB/LevelDB have done this from the very beginning. If a WAL file gets too long then a new one is created and the old one is sorted and written to an sstable asynchronously.

ddorian43
3 replies
1h28m

and the old one is sorted and written to an sstable asynchronously

This doesn't happen.

mike_hearn
2 replies
31m

No? Am I mis-remembering? Their wiki says:

The logfile is a sequentially-written file on storage. When the memtable fills up, it is flushed to a sstfile on storage and the corresponding logfile can be safely deleted.

[...]

Background compaction threads are also used to flush memtable contents to a file on storage. If all background compaction threads are busy doing long-running compactions, then a sudden burst of writes can fill up the memtable(s) quickly, thus stalling new writes. This situation can be avoided by configuring RocksDB to keep a small set of threads explicitly reserved for the sole purpose of flushing memtable to storage.

ddorian43
1 replies
27m

The in-memory memtable gets converted to sstable.

The WAL is ONLY read after crashing, to fill a new memtable.

Your comment looked like "WAL is sorted and converted to sstable":

If a WAL file gets too long then a new one is created and the old one is sorted and written to an sstable asynchronously.
mike_hearn
0 replies
19m

Seems a bit pedantic. The memtable is (when fully flushing writes) a derivative of the WAL. Or vice-versa if you like. They hold equivalent data, organized differently (yes yes modulo tombstones). Anyway you're right, I was being lazy in not writing out memtable explicitly.

raj555
3 replies
5h40m

tangential point: one thing that always bothered me about WAL is that it is supposed to exist to help maintain data integrity, recover from crash etc but that file itself is written (committed to disk reliably) in batches and not after every change to the database, apparently to gain performance. Doesn't that defeat the purpose? How haven't things broken down despite this? Not specific to sqlite but databases in general. Never found an answer to this.

formerly_proven
0 replies
4h11m

You only need to achieve durability on a COMMIT.

eatonphil
0 replies
4h35m

You can still batch and block before returning from a request to maintain durability. This improves throughout at the expense of latency.

Since SQLite is single writer I'm not sure if it does this. But this (batch yet block) is how I understood Postgres works.

Of course you can turn off the blocking too by setting postgres fsync configuration to an interval rather than synchronous.

bawolff
0 replies
5h19m

I think that depends on the setting of PRAGMA synchronous.

I'm not an expert on this, but i think the idea is to separate durability from db corruption. (When synchronous = normal instead of full) you can potentially lose (comitted) data in WAL mode if a power failure happens at just the right moment, however your database won't be corrupt. No data will be half written. Each transaction will either be fully there or fully missing.

https://www.sqlite.org/pragma.html#pragma_synchronous

alberth
3 replies
3h3m

Bedrock

Bedrock is the more interesting branch.

It’s WAL2 + CONCURRENT

It’s also the branch Expensify uses to scale to 4M QPS, on a single node (6-years ago)

https://sqlite.org/src/timeline?r=bedrock

https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a...

pstuart
1 replies
2h49m

That certainly sounds appealing. Do you have any insights into if and when that might land on trunk?

alberth
0 replies
2h46m

I'm not sure it ever will.

The primary use case of this branch is to make SQLite into a more "client/server" like architecture, which deviates from the predominate target use of SQLite (embedded).

Though I too would love a client/server version of SQLite.

HALtheWise
0 replies
1m

After investigating, it looks like there's actually two separate things here: "bedrock" for accessing the sqlite database over the network and "page-level locking" for allowing concurrent write transactions.

https://sqlite.org/src/doc/754ad35c/README-server-edition.ht...

I assume you could use the latter without the former.

spiffytech
1 replies
5h35m

Wal2 mode was included in the benchmarks for the HC-tree backend research:

https://sqlite.org/hctree/doc/hctree/doc/hctree/threadtest.w...

pstuart
0 replies
2h40m

Yikes! WAL2 + begin-concurrent = minimal TPS. I was expecting the opposite.

cjs_ac
1 replies
6h39m

As far as I can tell, this feature has not yet been released.

sgbeal
0 replies
6h5m

As far as I can tell, this feature has not yet been released.

It's been off-trunk since its inception in Oct. 2017 and there's been no discussion within the project of merging it into trunk (why that is i cannot speculate). It is actively maintained for use with the bedrock branch, as can be seen in the project's timeline:

https://sqlite.org/src/timeline?r=wal2

yashthakker
0 replies
6h17m

Great information! I'm looking forward to seeing this mode in action. Do you know when it will be available for general use?

no_wizard
0 replies
49m

I wonder how this with affect all the new-ish distributed SQLite systems like Litestream

maxbond
0 replies
5h8m

Very similar to the left-right primitive.

https://docs.rs/left-right/latest/left_right/

My understanding is that this technique is older than the linked implementation (though independently rediscovered), but notably, this implementation was written to support a different high concurrency SQL database (for some definition of that) called Noria.

magnat
0 replies
6h37m

Microsoft SQL Server uses similar architecture [1], but instead of using separate log files, it allocates Virtual Log Files (VLF) inside a physical (on-disk) log file. VLFs are allocated from a ring buffer and apparently there can be several thousands of them before things start to break.

[1] https://learn.microsoft.com/en-us/sql/relational-databases/s...

kevincox
0 replies
4h0m

I wonder how this would compare to just punching a hole in the already applied section of the WAL (1) file. It seems that this should allow the filesystem to free the disk space similarly to if the file had been deleted while allowing the writers to keep writing. I guess eventually it would become a problem as you hit max logical file size but it seems that you can likely bet on getting a chance to truncate it before hitting exabytes.