No amount of downtime - scheduled or otherwise - is acceptable for a service like Knock
doubt.jpeg
If you have a complex system, you have incidents, you have downtime. A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses. You're not a hospital and you're not the power station. So much fake work gets done because people think their services are more important than they are. The engineering time you invested into this, invested into the product, or in making the rest of your dev team faster, would've likely made your users much happier. Specially if you can queue your notifications up and catch up after the downtime window.
If you have enterprise contracts with SLAs defining paybacks for 15min downtime windows, then I guess you could justify it, but most people don't. And like I mentioned, you likely already have a handful of incidents of the same or higher duration in practice anyway.
This is specially relevant with database migrations where the difference in work to create a migration of "little downtime" to "zero downtime" is usually significant. In this case though, seeing as this was a one time thing (newer versions of PostgreSQL on RDS allow it out of the box) it is specially hard to justify in my opinion, as opposed to if this was going to be reused across many versions or many databases powering the service.
What? As a customer, this would piss me off to no end and honestly be a dealbreaker for something like payments or general hosting.
It's pushing dysfunction onto your customers, and if your customers are technically experienced, they'd know it's a completely avoidable problem.
Frankly I do not recall a single service without downtime, this includes banks I use. Yes I'd be mightily upset if said downtimes had lasted for days. 15 min - I do not give a flying hoot as long as it is not too often.
I suspect it's likely that the services that the other posters use _do_ have downtime, they are just done at hours where they don't notice them.
I would literally have no idea if gmail went down from 1-2 am any day of the week. Hell. I wouldn't notice if it was down everyday from 1-2am.
If you've got planned maintenance that requires downtime then you are always scheduling it at the times when your traffic is at its lowest. How much you avoid hard downtime is a function of how much money you're willing to spend on the maintenance.
Or how much revenue will be reduced by downtime.
If they're technically experienced, they know every 9 costs exponentially more money, and probably agree that it's a good tradeoff.
It’s funny to me as a physician to see “you’re not a hospital” as an example of a system that cannot tolerate downtime. Epic, probably the biggest EHR provider in the US, has planned downtime for upgrades at least monthly, for 30-60 min each.
So, the ER just shuts down for that hour?
Doesn’t epic cover everything from patient admission to medical imaging?
Hospitals expect to be able to run off paper for hours.
If anything, the hospitals are much more reasonable in terms of expectations of uptime than people's expectations of cat video avaliability.
We used to get 4 hour downtime windows to completely redo switch stacks for every switch closet in a location over a single week for example.
I designed control panel modifications and programmed an upgrade to a hospital diesel generation system so they could transfer from diesel back to utility without an outage, and have planned transfer of load to diesel without turning the lights out.
We had three windows at 1 am where any new critical patients would be diverted to a different hospital. The first we used for major maintenance to the breakers in the switchgear, the second we used for modifications to the bus work, and the last outage was to test the operation of the new control system.
They do a transfer to diesel every month and the whole hospital is aware of it in case it results in a blackout.
Fine as long as there is a workaround or the impact has been assessed.
OP here: It’s true that all services have downtime for one reason or another. We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data. Having a replica on PG 15 that was up to date with production data was invaluable for verifying our workloads worked as expected. Using a live replica makes it possible to trial run in production with minimal impact.
A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding.
As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.
1. You snapshot your RDS database (or use one of the existing ones I hope you have)
2. You restore that snapshot into a database running in parallel without live traffic.
3. You run the test upgrade there and check how long it takes.
4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.
I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.
I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.
By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.
We actually did those steps as part of our overall assessment, and you're right that we could have taken an outage window for that long and called it a day. We decided the tradeoff wasn't worth it for our situation, but taking the outage window is definitely a viable option.
I'm sympathetic to your comment that 15 minutes of planned downtime is fine for approximately 100% of SaaS companies. That's probably true here too, and maybe the work of doing this kind of upgrade was a waste in that regard. But, in considering the kind of product experience we would want for ourselves, zero downtime seems better than no downtime. The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it.
This part can definitely make sense, and if nothing else it can foster an engineering culture of "we care", which is great. I just wanted to show the other side but from your answers it seems like the team weighted the options. It's definitely a cool project to work on. Thanks a lot for engaging with a random grumpy guy on HN!
Random comment, but just wanted to say I really appreciate your blog post, but also I appreciate the informative and helpful discussion between you and vasco here. Feel like this could have easily devolved into defensiveness on either side, but instead I learned a lot from both of your responses - I feel like these kinds of interations are HN at its best. Thanks!
Except that there will be competitors who don't have a downtime every month.
And who are thus placing my needs ahead of their own.
Because your outage is my outage as well.
Unreasonable customers are best sent to competitors. Let them be their problem. All revenue is not equal.
Who said anything about downtime every month? Most companies I know do major DB version upgrades once every 2 years max, often less frequently.
It depends on what you are comparing. It's all about opportunity costs.
A service with some short and pre-announced downtimes is better than one that fails randomly every once in a while. It's also better than one that runs extremely old versions of their software, with old bugs and vulnerabilities.
You are right that when you 'sell' the downtime to customers you have to tell them what they are getting in return.
If Google Docs were down for 15 minutes while you were trying to get say a CV together or refer to some notes it would be pretty frustrating. SaaS is replacing the desktop so the expectation is similar, I can access my data whenever I want. And 2am might be OK except many SaaS have global customers.
That's why you announce your planned downtime long in advance, and put plenty of notice where customers can see it, even if they ignore emails etc.
Still sucks. But yeah I guess a countdown banner might be helpful there.
Google has a lot of downtime Never got 503 on google.com ? Or docs/meet down ?
The real problem with downtime is when all systems are down at the same time.
If Jira is down fifteen minutes a day that rarely affects me. I have other tasks in my work queue that I can substitute. Worst case with multiple outages there’s always documentation I promised someone. But when the entire Atlassian suite goes tits up at the same time, it gets harder for me to keep a buffer of work going. Getting every app in your enterprise using the same storage array is a good way to go from 5% productivity loss to 95%.
Someone once said to me: if you can't handle planned downtime, how are you going to handle unplanned downtime?
Except that there is no way to upgrade a Postgres instance on RDS with a planned 15 minute downtime. You can't control when the reboots happen, you start the process and the cutover might kick in an hour later, two hours later, three hours later - you don't know when the reboots are going to happen and you can't control it.
If you have replicas they'll upgrade in parallel and will reboot at random times for even more fun.
So unless you can afford random unavailability in a timeframe which can last several hours (depending on DB size) the logical replication approach is the only way to do upgrades on RDS.
The bigger the instance, the harder the problem.
15 minutes to migrate a large DB? It takes days just to run an alter column on our DB.