Zero downtime Postgres upgrades

No amount of downtime - scheduled or otherwise - is acceptable for a service like Knock

doubt.jpeg

If you have a complex system, you have incidents, you have downtime. A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses. You're not a hospital and you're not the power station. So much fake work gets done because people think their services are more important than they are. The engineering time you invested into this, invested into the product, or in making the rest of your dev team faster, would've likely made your users much happier. Specially if you can queue your notifications up and catch up after the downtime window.

If you have enterprise contracts with SLAs defining paybacks for 15min downtime windows, then I guess you could justify it, but most people don't. And like I mentioned, you likely already have a handful of incidents of the same or higher duration in practice anyway.

This is specially relevant with database migrations where the difference in work to create a migration of "little downtime" to "zero downtime" is usually significant. In this case though, seeing as this was a one time thing (newer versions of PostgreSQL on RDS allow it out of the box) it is specially hard to justify in my opinion, as opposed to if this was going to be reused across many versions or many databases powering the service.

A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

What? As a customer, this would piss me off to no end and honestly be a dealbreaker for something like payments or general hosting.

It's pushing dysfunction onto your customers, and if your customers are technically experienced, they'd know it's a completely avoidable problem.

Frankly I do not recall a single service without downtime, this includes banks I use. Yes I'd be mightily upset if said downtimes had lasted for days. 15 min - I do not give a flying hoot as long as it is not too often.

I suspect it's likely that the services that the other posters use _do_ have downtime, they are just done at hours where they don't notice them.

I would literally have no idea if gmail went down from 1-2 am any day of the week. Hell. I wouldn't notice if it was down everyday from 1-2am.

If you've got planned maintenance that requires downtime then you are always scheduling it at the times when your traffic is at its lowest. How much you avoid hard downtime is a function of how much money you're willing to spend on the maintenance.

Or how much revenue will be reduced by downtime.

and if your customers are technically experienced, they'd know it's a completely avoidable problem.

If they're technically experienced, they know every 9 costs exponentially more money, and probably agree that it's a good tradeoff.

It’s funny to me as a physician to see “you’re not a hospital” as an example of a system that cannot tolerate downtime. Epic, probably the biggest EHR provider in the US, has planned downtime for upgrades at least monthly, for 30-60 min each.

So, the ER just shuts down for that hour?

Doesn’t epic cover everything from patient admission to medical imaging?

Hospitals expect to be able to run off paper for hours.

If anything, the hospitals are much more reasonable in terms of expectations of uptime than people's expectations of cat video avaliability.

We used to get 4 hour downtime windows to completely redo switch stacks for every switch closet in a location over a single week for example.

I designed control panel modifications and programmed an upgrade to a hospital diesel generation system so they could transfer from diesel back to utility without an outage, and have planned transfer of load to diesel without turning the lights out.

We had three windows at 1 am where any new critical patients would be diverted to a different hospital. The first we used for major maintenance to the breakers in the switchgear, the second we used for modifications to the bus work, and the last outage was to test the operation of the new control system.

They do a transfer to diesel every month and the whole hospital is aware of it in case it results in a blackout.

Fine as long as there is a workaround or the impact has been assessed.

OP here: It’s true that all services have downtime for one reason or another. We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data. Having a replica on PG 15 that was up to date with production data was invaluable for verifying our workloads worked as expected. Using a live replica makes it possible to trial run in production with minimal impact.

A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding.

As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.

We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data

1. You snapshot your RDS database (or use one of the existing ones I hope you have)

2. You restore that snapshot into a database running in parallel without live traffic.

3. You run the test upgrade there and check how long it takes.

4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.

I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.

I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.

By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.

We actually did those steps as part of our overall assessment, and you're right that we could have taken an outage window for that long and called it a day. We decided the tradeoff wasn't worth it for our situation, but taking the outage window is definitely a viable option.

I'm sympathetic to your comment that 15 minutes of planned downtime is fine for approximately 100% of SaaS companies. That's probably true here too, and maybe the work of doing this kind of upgrade was a waste in that regard. But, in considering the kind of product experience we would want for ourselves, zero downtime seems better than no downtime. The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it.

The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it

This part can definitely make sense, and if nothing else it can foster an engineering culture of "we care", which is great. I just wanted to show the other side but from your answers it seems like the team weighted the options. It's definitely a cool project to work on. Thanks a lot for engaging with a random grumpy guy on HN!

Random comment, but just wanted to say I really appreciate your blog post, but also I appreciate the informative and helpful discussion between you and vasco here. Feel like this could have easily devolved into defensiveness on either side, but instead I learned a lot from both of your responses - I feel like these kinds of interations are HN at its best. Thanks!

A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

Except that there will be competitors who don't have a downtime every month.

And who are thus placing my needs ahead of their own.

Because your outage is my outage as well.

Unreasonable customers are best sent to competitors. Let them be their problem. All revenue is not equal.

Except that there will be competitors who don't have a downtime every month.

Who said anything about downtime every month? Most companies I know do major DB version upgrades once every 2 years max, often less frequently.

It depends on what you are comparing. It's all about opportunity costs.

A service with some short and pre-announced downtimes is better than one that fails randomly every once in a while. It's also better than one that runs extremely old versions of their software, with old bugs and vulnerabilities.

You are right that when you 'sell' the downtime to customers you have to tell them what they are getting in return.

If Google Docs were down for 15 minutes while you were trying to get say a CV together or refer to some notes it would be pretty frustrating. SaaS is replacing the desktop so the expectation is similar, I can access my data whenever I want. And 2am might be OK except many SaaS have global customers.

That's why you announce your planned downtime long in advance, and put plenty of notice where customers can see it, even if they ignore emails etc.

Still sucks. But yeah I guess a countdown banner might be helpful there.

Google has a lot of downtime Never got 503 on google.com ? Or docs/meet down ?

The real problem with downtime is when all systems are down at the same time.

If Jira is down fifteen minutes a day that rarely affects me. I have other tasks in my work queue that I can substitute. Worst case with multiple outages there’s always documentation I promised someone. But when the entire Atlassian suite goes tits up at the same time, it gets harder for me to keep a buffer of work going. Getting every app in your enterprise using the same storage array is a good way to go from 5% productivity loss to 95%.

Someone once said to me: if you can't handle planned downtime, how are you going to handle unplanned downtime?

Except that there is no way to upgrade a Postgres instance on RDS with a planned 15 minute downtime. You can't control when the reboots happen, you start the process and the cutover might kick in an hour later, two hours later, three hours later - you don't know when the reboots are going to happen and you can't control it.

If you have replicas they'll upgrade in parallel and will reboot at random times for even more fun.

So unless you can afford random unavailability in a timeframe which can last several hours (depending on DB size) the logical replication approach is the only way to do upgrades on RDS.

The bigger the instance, the harder the problem.

15 minutes to migrate a large DB? It takes days just to run an alter column on our DB.

Postgres sits at the heart of everything our systems do.

Did the people making these decisions never take Computer Science classes? Even a student taking a data structures module would realize this is a bad idea. There's actually more like two dozen different reasons it's a bad idea.

What do you mean? I don’t understand, how is using a database an architectural mistake?

It's a mistake to use one specific computer science concept (RDBMS) to solve 50 different problems. They mentioned logging and scheduling, two things RDBMS are not designed for and have specific limitations around. From just a general architecture perspective it's literally a single point of failure and limitation for every single aspect of the system. And it's vendor specific, it's not like you can just plug plsql code into any other RDBMS and expect it to work. It's so obviously a bad idea it's hard to comprehend taking it seriously

You'd probably be surprised by how many systems are just Postgres/mysql + Redis.

For example, it's dead easy to make a high-capacity message queue by just using SELECT ... FOR UPDATE SKIP LOCKED with Postgres transactions, and I would argue it's more reliable than a lot of microservice-everything setups by way of having very few moving parts.

Classic NIH syndrome. "I made it myself so it must be better", when it's clear that a single sql query doesn't remotely approach a complete solution for scheduling. But the ignorant use it because they don't know better, until they too fall into the trap and realize they spent 10x as much engineering work to get something they could have just installed from the web and been done with. Every generation seems to fall into this trap with another tech stack.

It's all trade offs, right? Introducing a new component to your stack isn't free, it's paid for by more operational complexity. Maybe it's worth it, maybe it's not, but there is a calculation that needs to be made that's not just "NIH syndrome".

If you already have a DB (and essentially every app does), it can be far less effort with the same or greater reliability to create a queue table than to set up RabbitMQ, NATS, etc. As long as you tune the vacuuming appropriately, it will last for quite a lot of scale.

Source: am a DBRE, and have ran self-hosted RabbitMQ and NATS clusters.

Sure, so install RabbitMQ as well.

As the saying goes... "now you have two problems". :)

It might not be good computer science to use one tool to solve 50 different problems; but it's not bad computer engineering to use one tool to solve 50 different problems that fit within its capabilities rather than using 50 different tools, all with their own operational expertise.

There's no need to have the best tool for every job. Although it's also important to be able to see when a many purpose tool is insufficient for a specific job as it exists in your system and then figure out what would be more appropriate.

You could honestly just do in memory SQLite and use that lol idk that’s what I did because I wanted to quickly be able to handle thousands of simultaneous scheduling tasks.

Took like two hours and it works fine. Customers are happy. Event logs persist to s3 in case I need to replay (hasn’t happened once yet).

Would be interested to hear more about your opinion on why using a database is a mistake.

I meant using a database for more than relational read-heavy data queries. I would need to write a small book. Tl;dr the data model, communication model, locking model, and operational model all have specific limitations designed around a specific use case and straying from that case invites problems that need workarounds that create more problems.

I hear you on that, and can say that Postgres is incredibly capable at going beyond typical relational database workloads. One example are durable queues that are transactionally consistent with the rest of the database play a unique role in our architecture that would otherwise require more ceremony. More details here: https://getoban.pro

We are also working on shifting some workloads off of Postgres on to more appropriate systems as we scale, like logging. But we intentionally chose to minimize dependencies by pushing Postgres further to move faster, with migration plans ready as we continue to reach new levels of scale (e.g. using a dedicated log storage solution like elastic search or clickhouse).

Using a datastore for which true master-master HA is at best a bolted-on afterthought when you explicitly want a zero-downtime system is a mistake in a pretty obvious way.

Using a datastore with a black box query planner that explicitly doesn't allow you to force particular indices (using hints or similar) is a more subtle mistake but will inevitably bite you eventually. Likewise a datastore that uses black-box MVCC and doesn't let you separate e.g. writing data from updating indices.

Is this a bit? The median CS undergrad has zero experience with large & successful software systems in the real world. Of course they wouldn't understand!

Yeah - in fact, this is probably a great example of stuff you don't learn in class that gets really clear in the real world:) Operational concerns trump a lot of other things, and shoving everything you can into 1 database technology is so much better to manage that it covers a lot of suboptimal fit.

The approach here is interesting and well-documented! However, this line gives me pause—

Modern customers expect 100% availability.

This is not my preference as a customer, nor has it been my experience as a vendor. For many workloads consistency is much more important than availability. I’m often relieved when I see a vendor announce a downtime window because it suggests they’re being sensible with my data.

OP Here - that's great feedback! Our hope is to build confidence in both the reliability of our product _and_ the consistency of the workloads. Of course, presenting the illusion of consistency while being flaky is far worse than managing customer expectations and taking intentional downtime to, in the long run, have better uptime.

Indeed, having periodic maintenance windows expected up-front probably leads to more robust architectures overall: customers building in the failsafes they need to tolerate downtime leads to more resilience. Teams that can trust their customers in that way can, in turn, take the time they need to make the investments they need to build a better product.

Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.

Google famously turn off a critical internal service for a minute our so, because they had promised 99.999% (or something like that) of uptime, but hadn't actually gone down in a few years.

In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.

I think it was Chubby they did that with.

It was, they talk about it here: https://sre.google/sre-book/service-level-objectives/

Thanks for giving the source!

In my defense, by being so vague, I can't accidentally reveal company secrets. (I used to work as an SRE for Google Photos for a while.)

Never saw this communicated by Google, but Netflix is the company I have in mind for doing that: https://github.com/Netflix/chaosmonkey

Yeah I’d be a lot more confident about this if you talked some about consistency vs. availability and the details of your workload that made you want to choose this trade off.

I have potentially a weird experience path here — worked with Galera a bunch early on because when we asked customers if they wanted HA they said “yes absolutely” so we sunk a ton of time into absolutely never ever going down.

When we finally presented the trade off space (basically that 10 minute downtime windows occasionally could basically guarantee that we wouldn’t have data loss) we ended up building a very different product.

depends who the customer is, I'm a customer of AWS and I expect 100% availability, mostly because my customers are everywhere in the world and there's no available window for downtime

If you have this 100% availability expectation you're going to have to face the reality that DBMS versions fall out of support, you will have to upgrade or AWS will force-upgrade you their way, the AWS-provided default mechanism has significant DB-size dependent downtime (in order to maintain consistency, and you really don't want to lose that), and that the only alternative is to go through the pain of sifting through your database estate and logically replicating table by table with verification as shown in this article, with care especially for large tables and reindexing, and you can't avoid that if you have the (IMO mostly unreasonable) expectation of 100% availability. Change the wheel mid-journey or take a pitstop.

The article is entirely about tooling for safely changing wheels mid-journey. In that context, it's not weird to expect the database to remain available during updates.

Yes, it will require more work and time than just taking the database down and performing the update while it is offline. But as long as the database remains available it doesn't really matter if the update takes 5 minutes or 2 days, just that you can do changes faster than they appear. Since DBMS updates happen at most every few months, that should hopefully not be a problem.

At one of my previous workplaces we had a multi-TB table that could take several days to migrate with the online tooling, and would take 12+ hours to migrate even offline. Nobody wanted to take 12+ hours of downtime (for a busy customer-oriented website) but as long as the db stayed up nobody much cared how long it took.

But you don't expect 100% availability from every server for every service in every region do you?

I'm a customer of AWS and I expect 100% availability,

AWS neither provides nor promises 100% availability. AWS will have SLAs on various services with the penalty only being a discount on your bill.

It's _your_ job to make your service resilient to a point where you are comfortable with your mitigations.

I'm a customer of AWS and I expect 100% availability

Well, you aren't gonna get it, it's a myth, like "5 nines" and such are, based on that businesses can foresee the unforeseen and plan ahead.

Whether a service is distributed or not, at some point some issue will come up and availability is going to stop for a while.

There is a better way than fully copying table content one by one which is very I/O heavy and will not work if you have very large tables.

You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This article from Instacart shows how to do it: https://archive.ph/K5ZuJ

If I remember correctly the article has some small errors but I haven't done this in a while and I don't exactly remember what was wrong. But in general the process works, I have done it like this several times upgrading TB-sized instances.

OP here: we looked at this and were not confident in manually advancing the LSN as proposed, and detecting any inconsistency if we missed any replication as a result. Table by table seemed more reliable, despite being more painstaking.

As long as you have the correct LSN there is no way for this to go wrong.

If you resume replication with an incorrect LSN replication will break immediately. I have spent way too much time trying to do this on my own before the blog post was written and I have seen it fail over and over again.

To give you more confidence, try with the LSN from the "redo starts at" log message. It looks close but it will always fail.

Sadly this isn't true. Postgres will happily replicate and skip data if you tell it too.

And there have been multiple bugs around logical replication in version ~10-15 that can cause data loss. None of these are directly related to lsn fiddling tho.

Indeed - People In The Know have some cocnerns with this approach: https://ardentperf.com/2021/07/26/postgresql-logical-replica... .

At $work we did use this approach to upgrade a large, high throughput PG database, but to mitigate the risk we did a full checksum of the tables. This worked something like:

    * Set up logical replica, via 'instacart' approach
    * Attach physical replicas to the primary instance and the logical replica, wait for catchup
    * (very) briefly pause writes on the primary, and confirm catchup on the physical replicas
    * pause log replay on the physical replicas
    * resume writes on the primary
    * checksum the data in each physical replica, and compare

This approach required <1s write downtime on the primary for a very comprehensive data validation.

You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This is a great recipe but needs small but important correction. We need to be careful with plugging pg_upgrade in this physical-to-logical replica conversion process: if we first start logical replication and then running pg_upgrade, there are risks of corruption – see discussion in pgsql-hackers https://www.postgresql.org/message-id/flat/20230217075433.u5.... To solve this, we need first to create logical slot, advance the new cluster to slot's LSN position (not starting logical replication yet), then run pg_upgrade, and only then logical replication – when the new cluster is already running on new PG version.

This is exactly how we (Postgres.ai) recently have helped GitLab upgrade multiple multi-TiB clusters under heavy load without any downtime at all (also involving PgBouncer's PAUSE/RESUME) - there will be a talk by Alexander Sosna presented later this week https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi... and there are some plans to publish details about it.

Thank you for linking this insightful discussion.

I am not sure why I never ran into this problem, unfortunately I don't have access to my notes anymore because I no longer work on this.

This approach has solved so many problems for me. I can do full vacuum, I can change integer columns to bigint, I can do major version upgrades, I can even move instances across AWS regions all with minimal downtime.

It's really great to see that people continue to tinker with it and that there are active discussions on the mailing list to keep improving logical replication. It's come a long way since the first implementation. Thanks for your contribution!

You caught our off by one bug :)

The sequence thing is definitely interesting, I stopped using them a while ago, using mostly sequential uuid (or uuid v7), or use something like HiLo https://en.wikipedia.org/wiki/Hi/Lo_algorithm

This PL/pgSQL function might help others looking to keep uuidv7 generation responsibility within the database until it's natively supported:

  -- IETF Draft Spec: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html

  CREATE SEQUENCE uuidv7_seq MAXVALUE 4095; -- A 12-bit sequence

  CREATE OR REPLACE FUNCTION generate_uuidv7()
  RETURNS uuid AS $$
  DECLARE
    unixts bigint;
    msec bigint;
    seq bigint;
    rand bigint;
    uuid_hex varchar;
  BEGIN
    -- Get current UNIX epoch in milliseconds
    unixts := (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::bigint;

    -- Extract milliseconds
    msec := unixts % 1000; -- Milliseconds

    -- Get next value from the sequence for the "motonic clock sequence counter" value
    seq := NEXTVAL('uuidv7_seq');

    -- Generate a random 62-bit number
    rand := (RANDOM() * 4611686018427387903)::bigint; -- 62-bit random number

    -- Construct the UUID
    uuid_hex := LPAD(TO_HEX(((unixts << 28) + (msec << 16) + (7 << 12) + seq)), 16, '0') ||
                LPAD(TO_HEX((2 << 62) + rand), 16, '0');

    -- Return the UUID
    RETURN uuid_hex::uuid;
  END;
  $$ LANGUAGE plpgsql VOLATILE;

  SELECT generate_uuidv7();

Keeping id generation inside the app is useful, you can batch multiple statements (e.g. insert product, insert product details in a single query, or other sorts of dependencies). You don't have to wait for first insertion to finish to get the id of the record, for example.

Good point. There are implementations of this in various languages, then, that are linked in the blog post.

I'm not sure how difficult it would be to tell postgres to only generate the identifier if it's not part of the inserted data already... but that might be a nice compromise

OP here - we avoid sequences in all but one part of our application due to a dependency. We use [KSUIDs][1] and UUID v4 in various places. This one "gotcha" applies to any sequence, so it's worth calling out as general advice when running a migration like this.

[1]: https://segment.com/blog/a-brief-history-of-the-uuid/

Definitely great call out. Thanks for writing this.

Not to downplay the absolute behemoth of a task they manage to pull out successfully...but why not upgrading as new versions came along, with less fanfare?

It is a great read, but I can't shake the feeling that it's about a bunch of sailors that, instead of going around a huge storm, decided to go through it knowing fully well that it could end in tragedy.

Is the small upgrades out of the question in this case? As in "each small one costs us as much downtime as a big one, so we put it off for as long as we could" (they hint at that in the intro, but I might be reading too much into it).

OP here - we would have used the same approach for the minor upgrades. This isn’t a case of “we procrastinated ourselves into a corner” and more a matter of “if it isn’t broke, don’t fix it” recognizing we would need to make the jump eventually.

Just for your information, minor upgrades on Aurora Postgres does now claim increased resilience across minor upgrades, there are some caveats despite the Zero Downtime naming: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

I've relied on this as the minor upgrade method since it was available and it has worked as advertised, with no perceivable issues. This may be traffic and operation dependent obviously but worth having a look at.

Worth saying we do the minor upgrades incrementally, intra-day and a few weeks to a month after they are available, as a matter of routine, with a well documented process. Overhead is minimal to practically zero.

Upgrading N versions is just as much as a threat to availability regardless if N is 1 or 3.

Each one incurs some downtime. If their real answer is less than 60 seconds, then they’d have incurred that multiple times on the road to 15.

With the mention of AWS RDS and Aurora, I am curious if you had thought about creating a replication slot, adding a replica to the cluster and then promoting the replica to its own cluster. Then connecting the new cluster to the original with the replication slot based on the position of the snapshot. This would save the large original replication time and also keep the sequences consistent without manual intervention.

That's a very interesting approach, I'm not sure if the sequences would remain consistent under that model or not. AWS RDS Aurora also requires you to drop replication slots when performing version upgrades, so we would unfortunately have lost the LSNs for replication slots that we use to synchronize with other services (e.g. data warehouse).

I'd look into it more next time if it weren't for the fact that AWS now supports Blue/Green upgrades on Aurora for our version of Postgres. But, it's an interesting approach for sure.

Yeah its been nice to leverage this while working on some of our larger multi TB non-partitioned clusters. We have seen snapshots restore in under 10 minutes across AWS Accounts (same region) as long as you already have one snapshot shipped with the same KMS keys. We have been upgrading DBs to lift out of RDS into Aurora Serverless.

If anyone here knows how to get LSN numbers after an upgrade/cluster replacement. I would love to hear about it since its always painful to get Debezium reconnected when a cluster dies.

I looked at getting LSN numbers after an upgrade/cluster replacement, and IIRC restoring from a snapshot emits LSN information into the logs, but it's a bit of of a mixed bag as to whether or not you get the __right__ LSN out the other side. Because the LSN is more a measure of how many bytes have been written within a cluster, it's not something that meaningfully translates to other clusters, unfortunately.

Agreed, the snapshot does output a message in the logs but based on our conversations with AWS it was suggested that we use the SQL Command to determine the LSN. Sometimes depending on revision you won't get the logs and other times the log line is emitted twice based on the internal RDS consistency checks. Makes me long for GTIDs from MySQL MariaDB Galera[1]. They worked super well and we never looked back at my last company.

[1] https://mariadb.com/kb/en/using-mariadb-gtids-with-mariadb-g...

AWS supports blue green deployments now. https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-rd...

You’re right! OP here. We were on 11.9 which is not supported by Blue/green deployments for Aurora. Maybe next time.

Would upgrading to 11.21 and then using blue/green have been easier? I'm asking as someone with RDS postgres-aurora running 11.9 right now, so I'm genuinely curious on your thoughts

That might be possible. We were within days of performing our upgrade when the blue/green feature became available for Postgres, so we didn't consider it for our work.

You may be able to boot up an 11.21 replica in an existing Aurora cluster as a read replica, and then failover to that replica as your primary, which would be a minimally disruptive process if your application is designed to tolerate replica failover.

From there, you could upgrade the rest of your cluster to 11.21, and then use the blue/green upgrade process for AWS. If you do, I'd love to hear about how it goes as we will definitely consider the blue/green feature next time.

Having just tried a few weeks back ... don't rely on it for PostgreSQL yet. After a few back and forths my experiment got stalled for hours before AWS UI admitted the switch over didn't take. Thankfully it failed safe. But I have no faith in being able to time the actual switch over for any GB+ dataset.

Another epic win for the BEAM!

OP here - we have more coming about the role that the BEAM VM played in this migration too.

(The BEAM is the virtual machine for the Erlang ecosystem, analagous to the JVM for Java. Knock runs on Elixir, which is built on Erlang & the BEAM).

I’m stoked to hear more. It’s phenomenal tech.

Heroku just does this. At my old job we would scale the database using replication multiple times a week depending on expected traffic.

https://devcenter.heroku.com/articles/heroku-postgres-follow...

Not quite I don't think. For a busy database, The Heroku followers will not catch up to your upgraded database as quickly, so during an upgrade using Heroku's physical replication (as opposed to logical), there will be a time period where your freshly upgraded primary is on its own as the followers are being issued and brought up to date.

Except Heroku has an issue of your backup etc is too large (despite paying for the correct size), it would cause the replica to go down and spin up a new one, and this process could take hours.

Uh... How big was their database? Did I miss it? I don't think they said.

OP here. We don’t specify, but it’s big enough that it’s not reasonable to do a dump and restore style upgrade.

The strategies in the post should work for any size database. The limit becomes more a matter of individual table sizes, since we propose using an incremental approach to synchronizing one table at a time.

I'll take your word for it. Anyway, thanks for sharing the article.

This is great! I wrote a tool that automates most of the things you came across. If you find it useful or would like to extend it with your feedback/ideas, I'd love to have them! Thanks for sharing

https://github.com/shayonj/pg_easy_replicate

Neat tool! Some of our findings for large tables could be interesting for a tool like this, making it easier to apply the right strategy to the right tables. Having something like this with those strategies could be indispensable to teams running a migration like this in the future.

Surprised you can't initialize a replica from a backup. That would have saved all the farting around streaming the old stable DB content to the new server.

Also, this isn't "zero downtime" -- there's a few seconds down time while service cuts over to the new server.

The article omits details on how consistency was preserved -- you can't just point your application at both servers for some period of time, for example. Possibly you can serve reads from both (but not really), but writes absolutely have to be directed to only one server. Article doesn't mention this.

Lastly, there was no mention of a rollback option -- in my experience performing this kind of one-off fork lift on a large amount of data, things sometimes go off the rails late at night. Therefore you always need a plan for how you can revert to the previous step, go to bed with the assurance that service will still be up in the morning. Specifically that is hard if you've already sent write transactions to the new server but for some reason need to cut back to the old one. Data is now inconsistent.

OP here:

Can't initialize a replica from a backup

You could, but you're not going to get any of the constant writes happening during the backup. You will have missing writes on the restored system without some kind of replication involved unless you move up to the application layer.

For example, you could update your app to apply dual writes. I'm aware of teams that have replatformed entire applications on to completely different DBs that way (e.g. going from an RDBMS to something completely different like Apache Cassandra).

For our situation, dual-writes seemed more risky than just doing the dirty work of setting up streaming replication using out of the box Postgres features. But, for some teams it could be a better move.

This isn't "zero downtime"

and

The article omits details on how consistency was preserved

In the post we go into detail about how we preserved consistency & avoided API downtime, but the gist is that the app was connected to both databases, but not using the new one by default. We then sent a signal to all instances of our app to cut over using Launch Darkly, which maintains a low-latency connection to all instances of our app.

For the first second after that signal, the servers queued up database requests to allow for replication to catch up. This caused a brief spike in latency that was within intentionally calculated tolerances. After that pause, requests flowed as usual but against the new database and the cut over was complete.

We included a force-disconnect against any pending traffic against the old database as well, with a 500 ms timeout. This timeout was much higher than our p99 query times, so no running queries were force terminated. This ensured that the old database's traffic had ceased, and gave replication plenty of time to catch up.

No mention of a rollback option

Although it didn't make the cut for the blog post, we considered setting up a fallback database on PG 11.9 and replicating the 15.3 database into that third database. If we needed to abort, we could roll forward to this database on the same version.

We opted to not do this after practicing our upgrade procedure multiple times in staging to ensure we could do this successfully. Having practiced the procedure multiple times gave us confidence when it came to performing the cut over. We also used canary deployments in production to verify certain read-only workloads against the database, treating the 15.3 instance as a read replica.

To your point about it being late at night, we intentionally did this in the early evening on a weekend to avoid "fat finger" type mistakes. The cut over was carefully scripted and rehearsed to reduce the risk of human error as well.

In the event that we needed to rollback, the system was also prepared to flip back to the old database in the event of a catastrophic failure. This would have lead to some data loss against the new database, and we were prepared to reconcile any key pieces of the system in that scenario. To minimize the risk of data loss, we paused certain background tasks in the system briefly during the cutover to reduce the number of writes applied against the system. These details didn't make the blog post as we were going for more of the specifics to Postgres and less to Knock-specific considerations. Teams trying to apply this playbook will always need to build their own inventory of risks and seek to mitigate them in a context-dependent way.

Edit: More detail about rollback procedure

There is no problem in distributed systems that can't be solved with a well placed sleep(1000) :D

But anyway, good job, Postgres is quite a DBA unfriendly system (although better than it used to be still pretty bad)

Great write up y’all! Writing this detailed of a post isn’t easy, and it works to build confidence in your technical prowess. Keep up the great work :)

We're going through this right now with hava.io

AWS RDS postgress 11.13 > 15.5

We ended up going with a relatively straight forward approach of unidirectional replication using pglogical. We have some experience doing the same migration from Google Cloud sql to AWS rds with zero downtime as well, which made us pretty confident that this will work and not impact customers in any visible way.

pglogical makes it relatively straight forwards to do this kind of migration. It's not always fast, but if you're happy with waiting for a few days while it gradually replicates the full database across to the new instances.

For us it gave us a bit more freedom in changing the storage type and size which was more difficult to do with some of the alternative approaches. We had oversized our storage to get more iops, so we wanted to change storage type as well as reducing the size of the storage, which meant we couldn't do the simple restore from a snapshot.

Oh you mean like aws promised us during the "sales engineering" but failed to deliver when a major version upgrade was forced upon us?