Postgres Incremental Backup

A very much needed feature. Had a nightmare scenario in my previous startup where Google Cloud just killed all our servers and yanked out access. We got back access in an hour or so, but we had to recreate all the servers. At that point we were taking Postgres base backups (to Google Cloud Storage) daily at 2:30 AM. The incident happened at around 15:00 so we had to replay the WAL for the period of about 12.5 hours. That was the slowest part and it took about 6-7 hours to get the DB back up. After that incident we started taking base backups every 6 hours.

Google Cloud just killed all our servers

we were taking Postgres base backups (to Google Cloud Storage)

Rule #1 of backups - do not host backups in the same location as the primary

That's incorrect. You definitely do want backups in the same location as production if possible to enable rapid restore. You just don't want that to be your only copy.

The canonical strategy is the 3-2-1 rule: three copies, two different media, one offsite; but there are variations, so I'd consider this the minimum.

What other media should you store backups in? Tape? Paper print out?

Either papyrus or clay tablets if you want it to last.

More seriously, perhaps the "2 different media" means don't use, for instance, the same brand and/or model of hard drive for your multiple backups.

Papyrus doesn't last :) You want clay tablets, buried in the ground. Looking at Sumerian tablets that would give you 5000 years.

I thought papyrus lasted a really long time, as long as you sealed it in huge stone tombs in the desert.

I think we should build a big library in a lava tube on the Moon to store all the most important data humanity has generated (important works of art and literature, Wikipedia, etc.). That's probably our best hope of really preserving so much knowledge.

Some lasted at least 3000 years https://www.britannica.com/topic/Ebers-papyrus

In the original version that means tape, yes. It's the point most startups skip, but it has some merit. A hacker or smart ransomware might infect all your backup infrastructure, but most attackers can't touch the tapes sitting on a shelf somewhere. Well, unless they just wait until you overwrite them with a newer backup.

Don't forget to test the tapes, ideally in an air-gapped tape drive. One attack scenario I posed in tabletop exercise was to silently alter the encryption keys on the tape backups, wait for a few weeks/months, then zero the encryption keys at the same time the production data was ransomed. If the tape testing is being done on the same servers where the backups are being taken you might never notice your keys have been altered.

(The particular Customer I was working with went so far as to send their tapes out to a third-party who restored them in and verified the output of reports to match production. It was part of a DR contract and was very expensive but, boy, the piece of mind was nice.)

Historically tape, but in practice these days it means "not on the same storage as your production data". For example in addition to a snapshot on your production system (rapid point in time recovery if the data is hosed), a local copy on deduplicated storage (recovery if the production volume is hosed), and an offsite copy derived from replicated deltas (disaster recovery if your site is hosed).

The same principle can be applied to cloud hosted workloads.

As an example, for postgres, we have:

Backups on a pgbackrest node directly next to the postgres cluster. This way, if the an application figures a good migration would include TRUNCATE and DROP TABLE or terrible UPDATEs, a restore can be done in some 30 - 60 minutes for the larger systems.

This dataset is pushed to an archive server at the same hoster. This way, if e.g. all our VMs die because someone made a bad change in terraform, we can relatively quickly restore the pgbackrest dataset from the morning of that day, usually in an hour or two .

And this archive server is mirrored by and is mirroring some archive servers at different hosters entirely, also geographically far apart. This way, even if a hoster cuts a contract right now without warning we'd lose at most 24 hours of archives, which can be up to 48 hours of data (excluding things like offsite replication for important data sets).

Depending on the size of your data corpus a few USB disks w/ full disk encryption could be a cheap insurance policy. Use a rotating pool of disks and make sure only one set is connected at once.

Force the attacker to restore to kinetic means to completely wipe out your data.

You definitely do want backups in the same location as production if possible to enable rapid restore.

This is a distant second priority to ensuring any reliable backup.

The egrees fees will be bigger than your db cost.

Yes, maybe (some kind of diff/sync could maybe help), but this means using such a cloud is a bad IT practice.

Yes, the egress fees on base backups alone were higher than the cost of the DB VMs. If we replicate the WAL also, it would be way higher. In the post, the example DB was 4.3 GB, but the WAL created was 77 GB.

The joys of WAL bloat [0]. UUIDv4s?

[0]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-pa...

Did you have any recourse against Google Cloud? Did you ever find out why they did that?

I have forgotten the exact reason but it had something to do with not having a valid payment method. Some change on Google Cloud end triggered it - they were billing initially with the Singapore subsidiary and when they changed it to the India one, something had to be done from our end. Hardly got any notices and also we had around 100k USD in credits at the time. Got it resolved by reaching out to some high level executive contact we got via our investor. Their normal support is pretty useless.

Got it resolved by reaching out to some high level executive contact we got via our investor.

Oh man that is my nightmare. Nothing says "broken system" like having to circumvent the system to get something done.

i'm surprised the solution here isn't... moving out of google cloud. that is terrible

I've read about this happening a lot with google cloud

If your payments fail for whatever reason google will happily kill your entire account after a few weeks with nothing other than a few email warnings (which obviously routinely get ignored)

We simply take incremental ZFS snapshots

do you need to stop the db for the backup in order to ensure consistency of the snapshot?

Nope, CoW is wonderful. Postgres will start up in crash recovery mode if you recover from a snapshot, but as long as you don’t have an insane amount of WAL to chew through, it’s fine.

You shouldn't because a filesystem snapshot should be equivalent to hard powering off the system. So any crash-safe program should be able to be backed up with just filesystem snapshots.

There will likely be some recovery process after restoring / rollback as it is effectively an unclean shutdown but this is unlikely to be much slower than regular backup restoration.

wow! how big was the WAL? what kinda IOPS/disks are you using?

Don't remember the size, but the disk we were using had the highest IOPS available on Google Cloud. That was one of the reason why we had restore from GCS since these disks wouldn't persist if the VM shut down. I think it's called Local SSDs [0]. We were aware of this limitation and had 2 standbys in place, but we didn't ever consider the situation when Google Cloud would lock us out of our account, without any warning.

0 - https://cloud.google.com/compute/docs/disks/local-ssd

I wonder what stuff like this means in terms of electrical impact around the world at scale.

I'm being dramatic obviously but like... less CPU cycles, less network bandwidth, etc. etc.

If adopted by top 100 "large" database/operations, is it as much as like 100 people giving up their gas/petrol/diesel car?

https://en.wikipedia.org/wiki/Jevons_paradox

Glad I've learned about this fascinating effect, but it doesn't seem reasonable that such a niche improvement on Postgresql backup would have an impact on the price of electricity or another resource.

No, but you could imagine a situation where a feature added to a database makes people more likely to use that database, even in places where maybe a "lighter weight" database (e.g. SQLite) would've been good enough, therefore increasing electricity usage overall.

That + many more people will take backups (and most of those will be unnecessary).

Still, imho, hardly the scale to have the effect visible, if you think how many servers exist vs. the electricity demand reduction.

One example I'd find reasonable, perhaps the Apple Silicon M family has a shot at it, when comparing the energy draw diff (less than half of Intel's equivalent) multiply by the adoption at scale.. maybe that.

One more straightforward that definitely have a chance at achieving the effect: LED lightbulbs: huge consumption reduction, huge scale.

I did a thought experiment with made up numbers. I just woke up so forgive me if my numbers don’t add up. At least they can serve as ballpark numbers.

Let’s say a full backup takes 3 hours and an incremental one 3 minutes.

Let’s say I want at most 3 hours of lag in my backups. That would mean 3 hours every 3 hours. That is about 720 hours per month for a full backup or 720 minutes (12 hours) for incremental backups. Let’s say we Still do a full one once a day then that is still 102 hours per month in total.

Even if you do 1 incremental per hour you’d end up at 36 hours a month. If 3 minutes were true for the every 3 hour scenario, the every 1 hour scenario would also mean shorter backups so maybe 1 minute, which adds up to 12 hours.

All this to say that it might be more efficient.

Let’s say I want at most 3 hours of lag

AKA Recovery Point Objective (RPO)

I'm willing to bet that beyond a certain threshold there's no use for an incremental backup. You either would use (delayed) replication[1] to have an 'incremental' copy or would take full backups. I don't think any large database users rushing to use incremental backups when they have full redundancy.

[1] for example https://about.gitlab.com/blog/2019/02/13/delayed-replication...

I agree that for many use cases, standbys and in particular time-delayed standbys are a better option than backups. That said, having off-line backups in tamper-proof storage is a good call, too.

I think about this every time I see my corporate laptop spinning up because of bloated malware detection software, Teams, OneDrive, etc all seemingly unoptimized and burning resources across hundreds of millions of machines worldwide.

question: how does pgbackrest do incremental backup right now if postgres doesn't yet support it ?

By keeping the whole WAL.

The incremental backup would be just the WAL segments excluding the base backup.

With this new feature, the WAL segments you need to store are much, much smaller.

By keeping the whole WAL.

Not exactly. Pgbackrest can do incremental backups that are restored by applying file system level diffs (stored in incremental backup) to the base backup. WALs are only involved at the very last step when restore process is finalized by applying WALs that were generated during BACKUP step itself. But that's not pgbackrest limitation, that's how any Postgres restore process works.

EDIT: "how any Postgres restore process works" that restore a backup done from a live database that is being written to when backup is taking place.

pgbackrest works purely at the file level, by looking at checksums. It’s rather primitive by comparison.

At filesystem level, yes, but worth noting that since v2.46 (from May 2023) it can also work with block level granularity: https://pgbackrest.org/configuration.html#section-repository...

(OP here, happy to see this on HN!)

If you're interested in this topic, Robert Haas (the author/committer of the feature) also wrote two posts after I published this, which talk more about incremental backup:

http://rhaas.blogspot.com/2024/01/incremental-backup-what-to...

http://rhaas.blogspot.com/2024/01/incremental-backups-evergr...

A side note: Your RSS feed https://pganalyze.com/feed.xml misses most of your blog articles.

I'm actually subscribed to your blog in my RSS reader, but I completely missed this article (any many others) when it was published because it didn't show up in the RSS feed.

Good point - its set up this way since we intentionally don't syndicate the 5mins of Postgres episodes to Planet Postgres, but I could see the benefit of having a complete feed that includes everything, for those subscribing via RSS readers. Will see what I can do :)

To summarize. This is a new possible feature for the upcoming release. No, you don't have only full backups with Postgres right now - it is possible to perform incremental backups using Write Ahead Log (WAL), but with WAL you have to replay all the changes one by one while the new feature is a real diff meaning that backups will be smaller and faster to apply.

By watching which records the WAL changed since last backup, it saves a list of records to back up as incremental. Neat.

Now if your db works like append only, this may need t be as big as a savings as someone who updates the same records a lot (as at incremental backup time it only needs to save the final values of those records).

This could be used in replication topographies and scenarios. Maybe PG 18.

Incremental backup (without the useful wal summarizer in this case) is an interesting thing to try to implement from scratch.

I’ll have to read how the summarizer works specifically

Incremental backup (without the useful wal summarizer in this case) is an interesting thing to try to implement from scratch.

Changed block tracking is one option, or failing that, segmenting files at logical boundaries, checksumming those segments, and comparing the checksum with those of previous backup segments to see if that segement has changed (with a strategy for checksum collision avoidance).

Fantastic! I’m assuming you can wal replay on the incremental to get right where you want

I don't think so because the smaller size is achieved by ignoring (= summarizing) changes between each snapshots.

To do WAL replay (= Point In Time Recovery PITR) you'll still need the full incremental WAL files that take up more space (and is already available in PG since like ten years I think).

PS:Usually you'll also take one full pg_basebackup snapshot on regular basis to avoid replaying too much and to discard older snapshot.

So maybe the gain would be to combine theses process. Keep incremental WAL for short term PITR but apply incremental backup to snapshot to reduce long term backup size.

Anyway this is a really cool mew option!

I use borg for my daily backups, which saves me a ton of space, nevertheless this is a very welcome feature!

So it's just manual replication. Maybe just use async replication in the first place.

Awesome feature and a great demo as well.

I have been missing this for the last....15 years. I hope that it will make it into the final release.

I wonder how much time combining the backups will take If you have 100 or 500GB, the tests didn't address this. I might have to spin up an instance to try it out.