A very much needed feature. Had a nightmare scenario in my previous startup where Google Cloud just killed all our servers and yanked out access. We got back access in an hour or so, but we had to recreate all the servers. At that point we were taking Postgres base backups (to Google Cloud Storage) daily at 2:30 AM. The incident happened at around 15:00 so we had to replay the WAL for the period of about 12.5 hours. That was the slowest part and it took about 6-7 hours to get the DB back up. After that incident we started taking base backups every 6 hours.
Rule #1 of backups - do not host backups in the same location as the primary
That's incorrect. You definitely do want backups in the same location as production if possible to enable rapid restore. You just don't want that to be your only copy.
The canonical strategy is the 3-2-1 rule: three copies, two different media, one offsite; but there are variations, so I'd consider this the minimum.
What other media should you store backups in? Tape? Paper print out?
Either papyrus or clay tablets if you want it to last.
More seriously, perhaps the "2 different media" means don't use, for instance, the same brand and/or model of hard drive for your multiple backups.
Papyrus doesn't last :) You want clay tablets, buried in the ground. Looking at Sumerian tablets that would give you 5000 years.
I thought papyrus lasted a really long time, as long as you sealed it in huge stone tombs in the desert.
I think we should build a big library in a lava tube on the Moon to store all the most important data humanity has generated (important works of art and literature, Wikipedia, etc.). That's probably our best hope of really preserving so much knowledge.
Some lasted at least 3000 years https://www.britannica.com/topic/Ebers-papyrus
In the original version that means tape, yes. It's the point most startups skip, but it has some merit. A hacker or smart ransomware might infect all your backup infrastructure, but most attackers can't touch the tapes sitting on a shelf somewhere. Well, unless they just wait until you overwrite them with a newer backup.
Don't forget to test the tapes, ideally in an air-gapped tape drive. One attack scenario I posed in tabletop exercise was to silently alter the encryption keys on the tape backups, wait for a few weeks/months, then zero the encryption keys at the same time the production data was ransomed. If the tape testing is being done on the same servers where the backups are being taken you might never notice your keys have been altered.
(The particular Customer I was working with went so far as to send their tapes out to a third-party who restored them in and verified the output of reports to match production. It was part of a DR contract and was very expensive but, boy, the piece of mind was nice.)
Historically tape, but in practice these days it means "not on the same storage as your production data". For example in addition to a snapshot on your production system (rapid point in time recovery if the data is hosed), a local copy on deduplicated storage (recovery if the production volume is hosed), and an offsite copy derived from replicated deltas (disaster recovery if your site is hosed).
The same principle can be applied to cloud hosted workloads.
As an example, for postgres, we have:
Backups on a pgbackrest node directly next to the postgres cluster. This way, if the an application figures a good migration would include TRUNCATE and DROP TABLE or terrible UPDATEs, a restore can be done in some 30 - 60 minutes for the larger systems.
This dataset is pushed to an archive server at the same hoster. This way, if e.g. all our VMs die because someone made a bad change in terraform, we can relatively quickly restore the pgbackrest dataset from the morning of that day, usually in an hour or two .
And this archive server is mirrored by and is mirroring some archive servers at different hosters entirely, also geographically far apart. This way, even if a hoster cuts a contract right now without warning we'd lose at most 24 hours of archives, which can be up to 48 hours of data (excluding things like offsite replication for important data sets).
Depending on the size of your data corpus a few USB disks w/ full disk encryption could be a cheap insurance policy. Use a rotating pool of disks and make sure only one set is connected at once.
Force the attacker to restore to kinetic means to completely wipe out your data.
This is a distant second priority to ensuring any reliable backup.
The egrees fees will be bigger than your db cost.
Yes, maybe (some kind of diff/sync could maybe help), but this means using such a cloud is a bad IT practice.
Yes, the egress fees on base backups alone were higher than the cost of the DB VMs. If we replicate the WAL also, it would be way higher. In the post, the example DB was 4.3 GB, but the WAL created was 77 GB.
The joys of WAL bloat [0]. UUIDv4s?
[0]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-pa...
Did you have any recourse against Google Cloud? Did you ever find out why they did that?
I have forgotten the exact reason but it had something to do with not having a valid payment method. Some change on Google Cloud end triggered it - they were billing initially with the Singapore subsidiary and when they changed it to the India one, something had to be done from our end. Hardly got any notices and also we had around 100k USD in credits at the time. Got it resolved by reaching out to some high level executive contact we got via our investor. Their normal support is pretty useless.
Oh man that is my nightmare. Nothing says "broken system" like having to circumvent the system to get something done.
i'm surprised the solution here isn't... moving out of google cloud. that is terrible
I've read about this happening a lot with google cloud
If your payments fail for whatever reason google will happily kill your entire account after a few weeks with nothing other than a few email warnings (which obviously routinely get ignored)
We simply take incremental ZFS snapshots
do you need to stop the db for the backup in order to ensure consistency of the snapshot?
Nope, CoW is wonderful. Postgres will start up in crash recovery mode if you recover from a snapshot, but as long as you don’t have an insane amount of WAL to chew through, it’s fine.
You shouldn't because a filesystem snapshot should be equivalent to hard powering off the system. So any crash-safe program should be able to be backed up with just filesystem snapshots.
There will likely be some recovery process after restoring / rollback as it is effectively an unclean shutdown but this is unlikely to be much slower than regular backup restoration.
wow! how big was the WAL? what kinda IOPS/disks are you using?
Don't remember the size, but the disk we were using had the highest IOPS available on Google Cloud. That was one of the reason why we had restore from GCS since these disks wouldn't persist if the VM shut down. I think it's called Local SSDs [0]. We were aware of this limitation and had 2 standbys in place, but we didn't ever consider the situation when Google Cloud would lock us out of our account, without any warning.
0 - https://cloud.google.com/compute/docs/disks/local-ssd