We migrated our PostgreSQL database with 11 seconds downtime

I'm quite negatively surprised that a government service is moving from their own platform to AWS for such an important service.

AWS has a lot of pre-audited compliance built into their services. Being able to inherit their certification for services can save an organization a lot of time and effort.

Its not an organisation, its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing? Still looking for their product market fit and can’t afford the headcount? Is it a joke?

EDIT If they are looking for money id like to participate a bit in the seed round

AWS has a G-Cloud for UK just like they have one for US, no?

Why can't the UK government build there own cloud?

It's just completely insane to me that they would make the gov internet infrastructure completely (geopolitically) dependent on another country AND just literally give all their (citizens') data away AND pay for that "privilege"?!

I mean if the government can't host the government's websites using tech from the government's country, maybe it would be better to just forget about the whole cyberweb thing altogether? Just turn it off?

I don't think you have any idea just how much it costs to run infrastructure at the reliability levels provided by AWS, and just how much investment it would require to get the ball rolling on this.

A lot of people have a very unrealistic picture of what government budgets are like.

I don't think you have any idea just how much it costs to run infrastructure at the reliability levels provided by AWS

my $12/year VPS does better than us-east-1

Where do you get a 12 dollar a year vps. Hetzner charges me 4 bucks a month and it feels like a steal

https://lowendtalk.com/

quality varies

you want every government to build their own cloud? what in the world? the whole world is interlinked, should they also manufacture their own government laptops in the UK?

Why should they build their own cloud, seeing that costs more money?

I've always wondered how beholden the world is to Microsoft. I was once surprised to learn the US military (and probably virtually all others) don't have their own OS to avoid being tied to a particular company.

They'd still be outsourcing to a firm to do this. They wouldn't hire a load of people to do it in-house. See also Fujitsu in the recently-popular Horizon scandal, or the NHS for IT debacle[0].

[0] https://en.wikipedia.org/wiki/NHS_Connecting_for_Health

You know that UK's National Health Service did a deal with Palantir for "federated data platform"?

Why can't the UK government build their own cars? Their own boots? Their own pens, paper? How wasteful and pathetic that they wouldn't make all those things themselves. If it's possible to do it yourself, by golly, you should do it yourself, and there's absolutely no reason in the entire world to purchase those things from someone else instead.

No. It is only relatively recently (~5/6 years) AWS have had any data centres in the UK.

That blocked use of AWS for a lot of UK departments due to data sovereignty concerns.

I see. I don't use AWS much, I saw this [1] and assumed this was like the US g-cloud.

[1] https://aws.amazon.com/government-education/g-cloud-uk/

AWS has US based GovCloud regions: AWS GovCloud (US-East) and AWS GovCloud (US-West). It does not have a UK specific GovCloud region that I am aware of.

AWS has government specific regions (called GovCloud). Many services or features make it to GovCloud later than other regions because of the certification requirements.

Sysadmins are cheaper than many people seem to think.

I had a person I trust a lot telling me that "if we go with a bare metal provider like GCore we'd have to hire someone", his reason for bringing that up was that the cost difference would be justified by not having to hire someone,.

However a GCore €400,000k/y bill becomes a €6,000,000~ if you were to use a public cloud, even with the scaling up and down when not in use (we are an extreme case of needing a lot of dumb unreliable compute thats geographically distributed).

I can hire a lot of sysadmins for that money, but I probably don't even need one because public clouds also need devops staff to manage the complexity anyway.

The risk is hiring a team of ineffective sysadmins, especially if your organization can’t assess sysadmin competence.

That would indeed be a risk, but the circular logic of this means no new company could ever have any competence outside of its founders. Which feels shortsighted.

Anyway, I am a former sysadmin. I am confident that I can identify competence in the requisite areas.

Governments tend to be far less competent at determining technical competence. Due to a wide variety of factors, governments tend to be completely uncompetitive in salary for technical positions meaning they're already hiring from the lower end of the pool (not including a few altruistic folks willing to forgo their market value).

At a company if a department isn't working out you just restructure and move on, but in the government, that team is going to retire in your org and collect pension from you, and there's very little you can do about that.

Everything you said seems to also apply for developers and the staff that would manage cloud resources.

Lack of cost control or effective use of a cloud provider leads to spiralling uncontrollable costs.

yeah, every company I know of that uses cloud has a team responsible for managing it anyway, and they don't seem much smaller than the team needed to manage on-prem. I don't really think this 'savings' exists in most cases.

because it can’t hire some system administrators?

Spoken like someone who has never worked in the public sector. Hiring can easily take 6+ months or more due to an ever-increasing list of requirements that government HR is required to fulfill, not least of which is passing a security clearance which takes even more time. The best people on the market rarely have the patience for this. Once your employees do get hired - on-boarding can take another few/several/more months, getting various permissions, technical documentation, etc. Everything is out-of-date because making changes requires committee consensus, in a culture that is risk-averse, because nobody notices when you out-perform (after all, the requirements were also set by a committee that doesn't know who you are) but something going wrong is grounds for termination. Public sector work over-relies on hiring contractors precisely to shift blame for failure to the contractors. Managed database services are excellent tools to shift this kind of catastrophic risk of data loss to a contractor/vendor (who is managing the database).

Governments not owning their data isn't due to technical or budgetary limitations - it's strictly cultural.

Fully agree with this. I'd also add that a lot of IT is buy not build, in general. That includes support. Particularly true for the public sector and has been in place well before AWS existed.

Outsourcing the complexity to run and maintain a secure reliable database cluster really is making good use of the managed service model.

Hiring can easily take 6+ months or more

Do you think this move didn't take even longer to plan?

to shift blame

That reason is much more plausible.

The UK replies entirely on the mercy of USA for its nuclear deterrent (Trident)

For the UK at least, that ship has _long_ since sailed....

trident has UK built warheads and is operationally independent of the US

(...) its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing?

This is a very good question, and bears repeating.

It's not a massive database as well. 400GB with 1k inserts/second.

If you are worried about that, start with your government use of Microsoft Office and Windows who both send MB of data per minute to a US based company.

How would you feel if the US government ran on servers from a European company, which also works very hard to avoid paying taxes in US soil?

All those reasons to go AWS hold for a private company, not for a government service of a first world country and G7 member. AWS has a lot of compliant services, but it's not like they're doing rocket science one of the top 5 richest countries in the world cannot afford to develop or contract within its borders.

The simple reason is that the UK has been on a long trend of selling out to the highest bidder, whether they are US tax avoiding companies, chinese or managed by Russian oligarchs. We have chosen AWS for the same reason Post Office chose Fujitsu.

I would be surprised if they aren't deploying to the London data center, so I would think it is within the UK

eu-west-2 is a bit misleading, most of its nowhere near London, they've got DC's right up into the midlands. One of their newer ones for example is out in Didcot Oxfordshire, they've also got a few up towards Peterborough. All classed as 'London' despite being a fair distance away from it.

There's no govcloud in the UK; unless there are specific terms then the terms-of-service state that you are licensing either the irish entity or the american entity to have access and dominion of your data.

I had to spend a lot of time writing my privacy policy (perks of being CTO... yay), and part of that privacy policy was an admission that we transfer ownership of data to a US company (by using public cloud) despite using european datacenters.

This is because our agreement is with a US entity.

The blogpost shows a connection string to eu-west-1 in Ireland

That sentence was a little confusing. You're not happy that the government is hiring experts to run an important service?

Yes. RDS is a very reasonable choice if you are a tech company, let alone a govt org. The alternative isn’t “let’s host this ourselves” it is “let’s host this with Oracle at a much higher cost”.

The alternative - at government scale - is absolutely 'let's host this ourselves' and that's what they should be doing, to ensure that institutional expertise remains. They should also own and operate their own datacentres which should be physically secure, not shared with commercial ventures and guarded by the armed forces, not civilian security.

Why doesn't the government manufacture their own cars? They're going to lose institutional expertise in building cars! They should also own and operate their own manufacturing facilities which should be physically secure, not shared with some 'civilian commercial venture'.

By golly, the government can't do business if it isn't a datacenter operations company, a software vendor, and a car manufacturer.

By golly, the government can't do business if it isn't a datacenter operations company, a software vendor, and a car manufacturer.

I mean, precisely, which is why some countries are rapidly turning into failed states. Too much buck-passing and outsourcing.

It isn't? AWS is crazy expensive and you don't have as much control over things as you may occasionally need. The best decision we took in the past few years with regards to infrastructure was moving away from AWS and doing everything ourselves.

On RDS we had inexplicable spikes in cost, deteriorating support and no real support for any of our issues. When we tried using DMS, it just didn't work as expected, even after spending two days on the phone with their support.

is hiring experts

'moving to AWS' (or any cloud provider) is not 'hiring experts' it's just outsourcing the risk to an entity that you, in the event of a genuine crisis, have no leverage over beyond 'we're going to stop paying you (once we migrate away from you which will take ten years)'

AWS are not experts at providing computing services? Holy cow. This is news to me! I thought they were the most popular and highly regarded computing infrastructure and PaaS company in the world, managing both hardware and software and providing subject matter experts to work with customers on architecture and implementation, along with official partners and a marketplace of turn-key products.

Boy, am I embarrassed! I need to start building my own datacenter right away, all my shit is on AWS!!!

They may be experts, but you sure aren't hiring them. You are renting something from them.

I don't know what it's like in UK but it may be the case that government has a hard time a{ttract,fford}ing talent to administer everything in house. Not that AWS is great for cost saving but if its between paying 50k/year for cloud services and not being able to find an engineer who will competently do the job for less than 50k, then the cloud is your only move really.

They require various clearances (digging into your life and past relationships to a miserable degree), don't allow someone to have ever smoked pot and pay half or less of what you can make in the pvt sector here (usa).

Everyone I know working FedRAMP jobs is prior military/g-level.

They wouldn't need that. And having been SC cleared in the UK, and known a few DV-cleared ones, at least in the UK they don't care if you've smoked pot. They just care that if you have, that you don't mind your family knowing one day. They don't want people who can be blackmailed.

Here it's like this: Don't ever lie to them, "no matter what it is they'll find out."

So, some people don't lie, say they smoked pot in high school and none of them make it to the next step.

I had a twitter convo last year or pre-x whenever with the CTO of some org I can't remember (I don't think centcom, something much smaller) and he mentioned that they've lightened up quite a bit, or at least his program which was a softwar engineering group was more lenient. He was looking for engineers on via twitter on his official account.

So maybe that's loosening up here thankfully.

and not being able to find an engineer

Remember it's not just about being able to find one single engineer - then they become key-person risk. You need multiple engineers to be able to handle the loss of that engineer, either temporarily (vacation) or permanently (suddenly hit by a bus). Then you end up having a team of DBAs. Then you have functional rather than feature teams. Then you need multiple managers to align to get anything done, and have internal politics.

Being able to consume databases as a product has non-trivial value.

Once your past the emerging startup status, running on the cloud involve as much engineers and complexity as running on prem if you want to follow best practices.

The "let's be managed and only hire developers" is a huge myth. All large organizations involve tons of "cloud engineers" or "devops" depending on how they want to call them and are just sysadmins with a different name and a bigger paycheck.

Having actual datacenters doesn't add a ton of complexity and datacenters themselves are often managed by people who don't even have an engineer paycheck. The main difference between being on prem vs cloud is you have to plan (how many servers/storage/network equipment you have to buy and replace on the following year) and pay for stuff (like space, racks) more in advance + take into accounts delays in delivery. This is where cloud makes the job much faster for companies but given the slow pace at which gov stuff happen usually I don't think this is a problem for them.

who will competently do the job for less than 50k, then the cloud is your only move really

Well, there is the other way, but, as we know, never ever that would happen.

As other comments point out, their own platform was (at least in terms of DB) already running on AWS, just using a different account.

As other comments point out, their own platform was (at least in terms of DB) already running on AWS, just using a different account.

That changes nothing. It just means this unjustifiable nonsense is going on for a while.

And an American one, at that (we’re talking government services here, not some SaaS). Are there really no native UK cloud providers?

Not any remotely comparable. The small “cloud” providers we do have were just reselling vSphere last time I looked into it.

GOV.UK PaaS also runs on AWS (for as long as it remains to exist)

As somebody who worked for the European Commission, and a european national government, I agree with your sentiment, but the harsh reality is that government divisions in generally work on a shoe string budget, when it comes to decisions like these. I wouldn’t be surprised if this was a “best effort given the circumstances” move.

If you saw how non-tech companies run datacenters, well let's just say they're not exactly working with NATO like the big 3 cloud providers do when designing their DCs and backbone.

Honestly you should be frightened when you see someone NOT using a cloud provider, because it is hard work to properly run and secure a datacenter. Even Equinix fucks up HARD regularly and they are considered the gold standard (shout out to those I saw at 350 E Cermak over the weekend).

I'd argue that AWS is much better suited than self-hosting because it's such an important service.

Downtime becomes negligible and global reach vastly increases with comparably little cost.

All UK businesses run on Oracle and Microsoft, so I'm not sure why you're surprised. They have us by the balls.

It incentivizes public-private cooperation: If the government cracks down on Amazon, Amazon turns off the government's AWS accounts and deletes the data. The government finds that subpoenaing a wiped hard drive is utterly nugatory, and thereby learns humility.

"The PaaS team offered us the ability to migrate databases using AWS Database Migration Service (DMS)."

And I'm not surprised if, they got some kickback, discount etc in some way to promote AWS on their blog. Not claiming its so, but I would not be surprised at all. It reads as one big advertisement.

This is an AWS RDS PostgreSQL database and it lives in the PaaS’ AWS account. Our apps that run in the PaaS talk to this database. We are going to call this database our ‘source database’.

It already was. Read the article.

I'm surprised that you're surprised. Why on earth would government not be migrating to the cloud?

There's an absolute ton of stuff on AWS. There used to be gCloud that allowed for smaller clouds to tender for government contracts bit there was a big pull to AWS, at least from my experience with it.

Why?

I've worked on a number of UK government projects, including some with particularly sensitive security and data requirements.

Having some knowledge of their on-prem data centres and UK Cloud offering they have also used moving to AWS has so many operational, security and resilience benefits that aren't available elsewhere. It's not a free-lunch by any means and needs thought and governance certainly but the procurement simplification benefits alone make going to the public cloud a no brainer for a lot of government services.

It is worth knowing that even the on-prem data centres are usually operated by 3rd parties such as HP, BT and IBM. There was an initiative to have "Crown-managed" data-centers but it's not particularly scalable.

We did a similar migration (somewhat larger database) with ~20 seconds of downtime and much less work... using the magic of AWS RDS Blue-Green Deployments [1]. Surprised they aren't mentioned in the thread yet.

Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).

When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".

Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).

1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...

I also used RDS Blue/Green deployment to apply a MySQL major engine version upgrade from 5.7 to 8.0. With respect to downtime it worked fantastically, I think we measured 13 seconds of observable downtime from the API.

However we did learn the hard way that RDS Blue/Green cannot be used to apply arbitrary changes. In our case, we discovered RDS Blue/Green can only be used to move up engine versions, not down.

We discovered on MySQL 8.0 one of our stored procedures had very occasional failures, and considered the option of using RDS Blue/Green again to move back down to 5.7. Turns out that's not an option.

Database changes are typically one-way. If your new change includes creating or modifying a table, such that there are new additional columns, and you populate those with data, then downgrading would destroy the changed columns and the data in them. Hence you can't downgrade once you upgrade or you'd potentially be breaking things. To downgrade safely you'd need to backup or snapshot the old database, and then restore your database back to the backup/snapshot, but that's not blue/green.

DB schema migration script frameworks (at least in Python, Ruby & Java lands) do typically support both upgrade and downgrade directions. People skip implementing and testing the downgrade side if the development model doesn't need it but the problem of what happens to the data is controlled by what you put in the "down" migration script.

I'd guess if you can't throw the data away, you won't do a down migration, you'll do an up migration that changes the db to save that data in your preferred way before undoing or reworking the previous schema change.

> DB schema migration script frameworks (at least in Python, Ruby & Java lands) do typically support both upgrade and downgrade directions.

They do, and in every shop I've ever been in these are considered a trap precisely because they don't consider data loss.

Always roll forward. If you have to change migration history, restore a backup and lament past you's hubris.

This is solved more cleanly in declarative schema management systems, where you have a schema repo of CREATE statements, and the tool can auto-generate the correct DDL. You never need to write any migrations at all, up or down. If you need to roll back, you use `git revert` and then auto-generate from there. The history is in Git, and you can fully leverage Git like a proper codebase.

A key component is that the schema management tool must be able to detect and warn/error on destructive changes -- regardless of whether it's a conceptual revert or just a bad change (i.e. altering a column's data type in a lossy way). My declarative tool Skeema [1] has handled this since the first release, among many other safety features.

That all said, schema changes are mostly orthogonal to database version upgrades, so this whole subthread is a bit different than the issue discussed several levels above :) The root of the blue/green no-rollback-after-upgrade issue discussed above is that MySQL logical replication officially supports older-version-primary -> newer-version-replica, but not vice versa. Across different release series, the replication format can change in ways that the older version replicas do not understand or support.

[1] https://github.com/skeema/skeema

but now you have 2 up paths. and migrations are critical, i would avoid it where possible!

Database migrations are always "fail forward" if there's an error you figure out what it was and fix it.

Our in-house schema migration tool supports downgrading, but it won't remove non-empty tables or columns etc.

For us this isn't a big deal though because we're writing our software so it should be able to function as expected on a DB with a newer schema. This makes upgrades much easier to handle has users can run new and old software side-by-side.

I wonder if that could be because MySQL 8's replication is backwards compatible but MySQL 5.7's isn't forwards compatible. If so, it makes sense that you're only able to move forward.

How did you go about stopping and restarting applications which reach out to the database? We have a number of tasks running in ECS which can take a minute to spin down and a few minutes to spin back up.

You aren’t supposed to have to change anything in the application code. The same database URL should work.

Presumably you don't stop them, and they throw errors during the cutover.

For our web service, we didn't stop anything. They had a few seconds of errors though it seems like some sessions were just buffered or paused and experienced high latency.

We also had background worker services. For the very high throughput ones, we spun down the # of tasks to a bare minimum for <5 minutes and let the queue build up, rather than have a massive amount of errors and retries. For the other ones where throughput wasn't high, we just let them be, and during the downtime they errored and retried and the retries mostly succeeded.

Has anyone encrypted the storage on a previously unencrypted RDS using Blue/Green?

We did the exact thing not too long ago, but we couldn't do it using blue/green. We were able to minimize downtime quite a bit, but it was on the order of minutes, not seconds. I wrote a little bit about the process here. I spent a lot of time prepping and running the migration, so happy to share any details if it's helpful.

https://phizzle.space/dbadmin/aws/postgres/2023/12/30/rds-en...

Good question. This was a pain point for my small team (me, myself, and I) a little while back. We had an unencrypted db deployed with CDK, and then tried to get it encrypted without losing data.

We recently did this on my team over Christmas this year. We opted not to use Blue/Green for this but instead spun up an encrypted snapshot and began replication from the old database to the new database using MySQL’s tools. Once traffic on our platform was sufficiently low, we cut connections to the database, waited for for replica lag to reach 0 and relaunch the servers with the new databases host info. Our downtime was around a minute.

+1 for B/G. That said, I imagine they couldn’t use it due to the cross-account shift. I’ve used it for both MySQL (with much higher QPS than TFA, by two orders of magnitude) and Postgres, and they both went flawlessly.

Read the docs, especially the limitations. Re-read them. Do a test run in the dev environment, under load. Do it again in staging.

Or just YOLO into prod ‘cause it’ll probably be fine, I guess.

We definitely did a dry run with a parity configuration a couple nights before. It gave us a lot more confidence.

+1 for Route53 Groups and B/G setups. We did something similar with PG upgrades, no downtime with AWS R53 groups & retry inflight queries with a custom Rails ActiveRecord transaction patch.

Trade off: For a few seconds some requests were slower.

DNS Groups w/ retries is a nifty mechanism for these things.

Tool used: https://github.com/shayonj/pg_easy_replicate

Blue/green is new, so am guessing most folks don't know about it. It's the way for close to zero downtime upgrades.

Am waiting for them to support upgrades from RDS Postgres to Aurora.

I used this about 2 months ago going from MySql 5.7->8.0 Really awesome feature.

We tried to work with DMS to migrate mysql to postgres, and it was a nightmare. Support was useless, and would often just not get back to us without prodding. Then them giving us canned responses unrelated to our queries. The whole thing is nigh on un-debuggable.

Stay away.

Surprised by that. I've used AWS DMS quite a lot to do both on-prem to AWS and AWS (MySQL) to AWS Postgres migrations and long term ongoing replication. Whilst there is some complexity/gotchas there it's always been more than up to the task. Takes a little bit of validation/testing to understand but it's very well documented too.

What sort of issues did you hit? In all honesty I'm not sure I've been more impressed by another AWS service.

One issue we hit were any schema changes totally messed it up. I don't have my notes in front of me, but we were constantly hitting data that wouldn't migrate, or that things suddenly broke whenever things changed.

Interesting we managed several schema changes when using AWS DMS for replicating data over a long period between MySQL and Postgres clusters.

We treated these carefully and tested as we made them but never had any real issues with them. From memory DMS could cope with adding columns pretty transparently. One setting we invested in configuring and understanding was to allow DMS to attempt to recover from replication failures. This allowed it to error on DDL changes and attempt to recover. This usually involved restarting the task, but it would do this transparently as part of the recovery.

It seemed to try to load data into the wrong table on postgres. That was the one that immediately comes to mind. Honestly poor support is what really killed it for us. But we had other technical problems with it.

We burned 3 weeks just trying to get support to provide a sensible response. I never got the sense anyone replying to us knew any more than the surface level about the infrastructure of how DMS worked.

We kept our table mappings pretty much mirrors of the source tables. Any data transformation we managed on the target cluster, not through AWS DMS.

I've used it fairly frequently over a number of years so maybe the issues we hit on the learning curve have dimmed.

We also deliberately kept things as simple as possible at first and focused on DMS's ability to move the data from source to target and really only tweaked settings that increased security or reliability. We stayed well away from any of the data transformation or more complex functionality.

Tried it for mysql and it was flat out broken and silently corrupted data.

For us it seemed to be trying to load data from the wrong table on mysql, into the wrong table on postgres.

I worked on migrating our MySQL system to PostgreSQL using pgloader ( https://pgloader.io/ ).

There were some hiccups, things that needed clarification in documentation, and some additional processes that needed to be done outside of the system to get everything we need in place, it was a amazing help. Not sure the project would've been possible without it.

Data mapping from PostgreSQL to PostgreSQL as in the article isn't nearly as bad as going between systems. We took a full extended outage and didn't preload any data. There were many dry runs before hand and validation before hand, but the system wasn't so mission critical that we couldn't afford to shutoff the system for a couple of hours.

We also ended up using pgloader. Its not without its friction either. For example the concurrency/number of rows setting seems broken out the gate, and its like playing darts with a blindfold on to get it to run without running out of memory. But being able to read the source, github actions, and overall at least I could debug my problems, or find others who had the same issue.

Would recommend pgloader.

+1 DMS is very half baked. Silent, unbuggable errors. Tons of unsupported LOB and CLOB data types. Built-in Postgres logical replication is way easier.

We also tried to use DMS for a few things (Rackspace to AWS migration, replication out to data lakes, etc) and it has been consistently undersupported, buggy and ate months of time before we went to other solutions. While a lot of AWS support has been good; not for DMS. It feels entirely half baked .

I can confirm that. Both DMS reliability and support quality were terrible.

This was my experience as well. We thought it was a nice managed way to move pg to pg with minimal setup but we ran into so many issues we just did logical replication with bespoke fixes for things it didn't handle well.

Interesting, though I have no idea why the government is using AWS in the first place. This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.

We could build a public sector cloud and/or adopt a sensible on-prem approach. This requires funding, coordination and technical leadership, but would save the taxpayer an enormous amount over the long term.

Public sector IT is a disaster in general ofc, but I know there are good engineers working there.

Disclaimer: I work in the public sector.

If you really think the public sector could build anything closely resembling any cloud, you are dreaming. Imagining that one cloud working for the entire public sector, we are entering delusional territory here.

Public sector projects are insanely expensive, take ages to develop and often flat out fail.

Not only that, we are starved for engineers even now. If we ran such a custom cloud, we would shoot ourselves in the leg by not being able to hire from a pool of experienced engineers.

Oh I totally understand that. Public sector IT is littered with failed projects, massive overspends and lasting functional problems. There are numerous examples from the last few years alone - hundreds of millions spent on broken software. I used to work in public affairs and followed a lot of those projects closely.

I don't think this means "government = bad at tech" though. You sometimes see smaller in-house teams do really good work. The biggest issue seems to be with contracting and procurement policy. For example, on the Police Scotland i6 program they hired a bunch of consultancies to write a tender document, and then hired CGI for the final project. That turned out to be a copy of the Spanish system, which turned into a huge and expensive disaster as the requirements differed.

Feels like government has a similar problem to a lot of legacy non-tech companies. They don't have much technical leadership, don't understand the problem, and decide to hand it off to the lowest bidder. Doesn't help that they are often legally obliged to act this way. But the underlying engineering problems aren't unsolvable, and don't need to become a huge mess every time. (Your point about recruitment is fair though)

The government likely hired a consulting firm and AWS was part of the solution proposed and bought.

This doesn’t sound like how GDS operates.

As a startup, my reason for using "cloud" (PaaS) is not to catch spikes, but because of focus. Every hour that I spend running around with HDDs, screwdrivers (or the cloud version thereof - storage, ansible, etc) is an hour that I'm not spending on stuff my customers need.

Why would this be any different for a government?

We don't expect our government to build their own cars, but to buy them from Volkswagen or Renault. Even when the government has a clear need for transport. Why do we then insist they build their own IT infrastructure?

Because the government is not a corporation and it's obligations are different from the private market.

I don't know about the UK, but AWS has has GovCloud in the US for a long time, and to be honest compared to a lot of infrastructure I have seen there it's a blessing. On the flipside, I've met some really amazing infrastructure and ops people in a German gov healthcare institution running the in-house DC, where the problem was neither the tech, nor the people, but 100% the management and their processes, and their desire to be the bottleneck for every single interaction between infrastructure and engineering teams.

This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.

If you think government needs and demands are predictable, you don't follow politics (particularly uk politics in the last decade).

And then there are these things like pandemics that completely come out of left field. Being able to scale things on demand over the pandemic was one of the key demonstrators for use of the public commercial cloud by the public sector.

The UK government as a whole operates across the big names in the public cloud as well as some on-Prem/colo.

I'd very much recommend watching https://youtube.com/watch?v=mpY1lxkikqM&pp=ygUOUmljaGFyZCB0b... from September about Gov.UK's various iterations and some of the migrations across cloud that they've had to do.

One thing about (at least UK government) is that procurement requirements means that they go to market for quotes around usage every few years. If ie Oracle Cloud was 1/10th the price, it would likely mean they'd win the deal, and so would have to migrate to Oracle for the duration of the contract, and then potentially do the same to another cloud if that was cheaper

I'm sure a lot of countries in EU did this. First hand I know, they did this in Croatia as I was one of the developer who had to use it to deploy on it. The worst thing I have ever seen in my life. And I worked on a lot legacy apps written in VB.NET, Web forms, old Sharepoint, Basic and even when the whole app was one big mess of store procedures.

AWS, Azure, GC are at least written with thought about end users (us, developers) while government cloud was architectured, designed and built by the lowest bidder whose first goal was to cut and cut and cut his costs whenever possible.

If you think the public sector could improve its operational efficiency by building, operating and using their own cloud, I've go a bridge to sell you.

Note that the enemy of low/zero downtime migrations like this is long running queries.

Ie. a single update query which takes 30 mins.

You either have to kill and roll back that query, or suffer 30 mins of unavailability.

As far as I know, there is no way to migrate a currently in progress query.

Well you’ve certainly introduced a teaching moment to me! What are the nature of writes you deal with that last 30+ minutes? What kind of data/what kind of people are involved with such DB writes where you need to rely on the DB engine to work so hard instead of something more split up by queues at a higher layer?

It's usually badly engineered things. However, just because it's badly engineered doesn't mean it's fine for it to break :-P

Things like a super complex n^3 complexity query to find all users who were 3 hops away from a known fraudster, where a 'hop' means 'shared any piece of account data in common'

Would that be more suited to a graph database such as Neo4j?

Postgres queries are streaming. Ie. If the query result has 1 billion rows, but the client only has enough ram for 1 million rows, then the query will be slowed down until the client is ready to accept more rows.

If the client is slowly processing through the result set, then the query can take many minutes/hours.

It's hard for me personally to imagine a 30-minute update query that is not written extremely inefficiently, or else a one-time huge data migration.

There are a lot of the former in the wild to be sure. I've had a lot of run turning minutes-hours into milliseconds. :)

Data scientists... And when you look at the query, it'll be 800 lines of SQL...

Heh, that's not so bad - 2 jobs ago I had to fix a KPI generation process that was 70k lines of dynamic sql, that unwrapped to up to 1m LOC :)

Did you ever run into a hard limit of statement size?

For a software engineering project, you probably want to limit your transactions to much less than that (set statement_timeout is your friend). If you've got extremely long transactions, you can probably avoid doing the switch-over when they run (hopefully they are not a random occurrence but a result of a scheduled job or similar).

In combination with transaction time limit and fail-over configuration (where you fail the old primary), you can control the slowdown (instead of downtime, eg. with pgbouncer) very precisely.

I would be more concerned with the DNS TTL being respected in the entire stack (and external caching DNS servers you rely on), tbh.

But it is usually not critical to avoid a dozen seconds of downtime for an app, so whatever is simpler for you should be your go to solution.

Minor detail but did anyone else notice they are using eu-west-1? Kind of weird for the UK to be hosting there sites in another country (Ireland). I'm sure this isn't super sensitive data but still.

The London AWS region was missing a lot of features for a long time. It's easier to just go with a big region.

"Lets go with the easier approach" is a bit worrying when it means potentially breaking security assumptions / legal assurances.

Though I guess it could be explained away by having something in the website's "Terms and Conditions for use". ;)

It's a gov website. What are they going to do, sue themselves?

Have a public enquiry, with lots of tea.

I think one of the bits of carry-over legislation in the Withdrawal Agreement after Brexit was the UK continuing to comply with GDPR. So using an Irish DC is still ok.

They've been on AWS since before eu-west-2 was added, it's still not at the scale of eu-west-1. Capacity issues are still a thing (but getting much better), and only recently have they become pretty much feature parallel.

It made a lot more sense before…you know…the thing that made no sense.

Interesting to compare this to https://knock.app/blog/zero-downtime-postgres-upgrades discussed here https://news.ycombinator.com/item?id=38616181

A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.

There is no "just" about it. The absolute key takeaway is in "what we learned":

We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.

The message here is not "just use DMS".

Even AWS in their own docs says to use the native tools when migrating from postgres to postgres[1]. They don't go into the details to much and points to pg_dump rather than pg_logical, but interesting to see that they don't recommend using DMS for it

[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source...

They do, but those recommendations are buried quite deep in the documentation, well behind all the marketing guff that suggests that DMS is all things to all people, and wonderful magic that is ideal for all situations.

There are a lot of gotchas with using DMS (which seems to use pglogical under the hood). Since it’s not hardware-level replication, you can run into issues with large rows/columns/tables and it doesn’t really handle foreign keys. It may not handle some special data types at all. You also need to update the sequences after the migration or you’ll get errors about duplicate primary keys. You can also have issues if you don’t have proper primary keys, because it doesn’t always copy the entire row at once.

If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.

There are many options available with PostgreSQL you could also do a physical full backup + WAL level replication to keep key AND a get low downtime.

What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.

Has anyone used https://cloud.google.com/database-migration/docs/postgres/qu... to do something like this? Does it work similarly to AWS DMS?

Is that a picture of them doing it?

What is this weird fantasy that programming looks like this? It's always the same: a couple of people, always of diverse backgrounds, physically interacting, usually pointing and laughing at something, with perhaps another group, representing different demographics, looking on with a mixture of curiosity and awe.

Do an image search for programming to see more examples of this.

I guarantee this work was the result of guys sitting at desks, possibly in the dark, thinking. Are we collectively embarrassed that this is what programming looks like? Or are these just fantasies of management types who have never written a line in their lives?

Give it a rest, mate.

Nah, this is the thing that leads to open offices etc. I want people to understand what programming actually looks like and these silly pictures need to go.

The primary goal of most large organisations' dev blog is to attract potential candidates to the top of the recruitment funnel.

This picture appears to be (presumably) the actual team doing a team building activity.

This is both a reasonable choice given the inferred instrumental goals of the blog, and a honest reflection of (one part of) the team life, assuming it's a photo of the real team.

They are clearly not programming. It's a team building game called the marshmallow challenge.

Now all we need is for Amazon to come out with a "government-as-a-service" product for your budding nation state.

FWIW GOV.UK Notify is part of a suite of services offered by GDS to UK public sector bodies (along with GOV.UK Pay and GOV.UK PaaS) that was originally known as "Government As A Platform".

…and the platform is all open source, so people are free to fork / steal. https://github.com/orgs/alphagov/repositories?q=paas

In partnership with pinkerton.com

Seems bad that government services depend on clod providers.

Why? Do you deny the benefits of shared hosting in its entirety?

No. But I wish government agencies depend less on other entities. A government agency isn't in a business to make money. They should be concerned less with economic efficacy and more with accomplishing their goals.

They have a fiduciary duty towards the citizens.

Out of all the entities to depend on, AWS is tried and battle tested with US gov.

Additionally, a pandemic has shown that government services actually need to scale to arbitrary amounts, seemingly at random.

Don't you think that in this case using AWS is better to simply guarantee availability? Imagine tax payers money being spent on servers and then a pandemic happens and when the economy is screeching to a halt, more tax payers money is needed to expand the cluster, money that really would be useful to spend elsewhere.

The second change was to create a DNS record in AWS Route53 for `database.notifications.service.gov.uk` with a 1 second TTL (time to live) [..] our migration script just needed to update the DNS weighting in AWS to 100% of results being sent to the target database location and wait 1 second for the TTL to expire. Then, when our apps next try to query our database they will be querying our target database.

Wait. Their (or Python's default?) db orm blocks while it performs a DNS lookup for each and every query!? It doesn't cache resolved addresses for any length of time? No connections are pooled and reused?

It would probably be the OS' `getaddrinfo` or `gethostname` that does this: Python rarely reimplements system level calls, which means it relies on the system's configuration.

If TTL of 1s was respected, they would be cached for 1s, but it's not uncommon for DNS query libraries and especially caching DNS servers to not fully respect TTL anyway: tbh, that might explain some of the downtime they've seen.

I didn't mean it was directly implementing the networking call to the dns server -- just that it wasn't directly caching the result.

getaddrinfo(3) and getnameinfo(3) (guessing that's what you meant) don't implement caching, at least not explicitly in the spec and not normally in practice. On Windows, DNS results are cached by the OS but on Linux that would be distro-dependent behavior and usually requires setting up a local caching dns server (Ubuntu uses unbound out-of-the-box, iirc. Other choices include ncsd and dnsmasq).

Even if they implemented caching at the syscall level, this still assumes no connection stays open for more than 1s or is reused except per query. It seems like a big assumption (at least I hope it is, because I certainly wouldn't want my app to initialize a new db connection, let alone perform a DNS lookup, for every query).

They mentioned they had a script which terminated all connections to the old database then changed the password.

But on the app side you typically don't cache DNS, that creates other problems like stale DNS.

DMS is a terrible migration tool, I spent almost a month fighting with various migration issues before I gave up.

It would not migrate text and json types. Even AWS support could not offer a solution.

We got in early testing AWS Blue/Green and that has made close to zero downtime upgrades a reality.

If you think DMS is a bad migration tool then try using it for ongoing replication to an external destination.

Completely broken.

We recently completed the migration of a self-hosted 3 TB PostgreSQL database from version 12 to 16, transitioning from Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various extensions, most notably Timescale, for which a compatible version did not exist across all scenarios. We performed the upgrade by updating a replica in the following sequence:

- Start: PG12, Ubuntu 18, TS2.9

- Step 1: Set up a read-only replica with PG12 on Ubuntu 22, maintaining TS2.9.

- Step 1.5: Enter maintenance mode and halt all services.

- Step 2: Detach the the read-only replica, upgrading from PG12 to PG15 on Ubuntu 22 with TS2.9.

- Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.

- Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.

- Step 4.5 : Reconnect services to the new database server, resume all services, and exit maintenance mode.

All the database upgrade steps were well-tested and automated using Ansible. Nonetheless, we did encounter an issue that had not arisen during testing. This extended our downtime to approximately half an hour, which, for our use case, was perfectly acceptable.

Employing logical replication could have mitigated the last-minute surprise. So we will consider this approach for our next upgrade cycle.

We recently followed an almost identical upgrade path (however at the time PG16 wasn't yet supported by Timescale, so stopped at PG15 + TS 2.12).

We did look into using logical replication to reduce the downtime of the upgrade, but because database schema and DDL commands aren't replicated, it seems that it isn't recommended with Timescale in the loop.. (I suppose the underlying schema changes that Timescale needs to make under the hood are mostly a function of your hypertable chunk sizing and what your incoming writes look like, so this could be planned around / timed well, but we felt it added too much potential complexity & risk compared to simply opting for a small maintenance window while pg_upgrade completed).

There are various ways to 'pause' incoming postgres queries, for example using pgbouncer, - ie. don't fail them, simply delay them until the replication has caught up and then let them continue on the new database.

If anything goes wrong and replication doesn't catch up, you can unpause and let those queries happen on the old database.

Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of added page load time. But more importantly, of the thousands of users of the database who have never seen a query fail before and might have buggy error handling codepaths or have a single failed query ruin a whole batch job, this approach leads to a lot less collateral damage.

It’s one thing to pause queries but can you also pause transactions that are in flight? How does that work?

They don't just sent the notifications but store them? Sounds like it might contain PPI as it records passport extension data, etc. Might be minimal, though over 1 billion rows (400GB) sounds massive trove to keep around.

I'd highly doubt their database contains payment protection insurance.

Lovely! We just migrated from postgres 14 to 16 for 3 postgres clusters (servers) on RDS containing about 2TB of data across 8 databases. We were down from 00:00 to 04:00. Steps we took:

  * enabled our fallback "maintenance mode" site. It's a super lightweight version of our site running on CF workers.
  * scaled down all apps using the db to 0 in terraform
  * hit the upgrade button in the aws web ui, which runs pg_upgrade. 14->15
  * waited for it to finish
  * hit the upgrade button again. 15->16
  * waited for the dbs to start accepting connections (they do before they're marked as ready, I think aws does more stuff than pg_upgrade)
  * Started `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to avoid performance issues between versions and make use of performance improvements from new versions.
  * Started spinning up the apps again
  * Waited until all apps had a handful of containers running
  * Started accepting traffic (disabled maintenance site)
  * Went to bed

The REINDEX CONCURRENTLY happily chugged along for the biggest db for another 18 hours without blocking anything. Next time we're doing aws blue/green deploys to avoid downtime. We didn't this time since we weren't on 14.9 yet (the minimum minor version of 14 supported by blue green).

If I was doing this myself I wouldn't pay the AWS tax, instead do blue/green myself with logical replication and a load balancer.

I would just use pg_upgrade with --hardlinks for an in-place upgrade.

Have done 2 TB dbs in less than a minute.

We were running our own Postgres instances on-prem.

I love that where the rubber meets the road it doesn’t matter if you’re a government organisation or a web agency, it’s still just plain old Postgres, python, sqlalchemy and dns with a low ttl.

To be fair, there's plenty of government orgs at the bottom and web agencies at the top.

Let's ignore the technicalities completely. Here's a government providing a detailed look into their engineering practices, complete with graphs and configuration snippets. How many governments manage to publish such a thing?

I really love how verbose and up to date the UK with tech

This doesn't seem to bode well for Oracle, but lobbying trumps the day.

nice writeup. easy to read, easy to understand

Very clear and concise blog! Loved reading it. I'd be very curious to see how Amazon DMS performs at scale. Scale includes either of these situations: a) Larger dataset - 2+TB b) Higher throughputs - WAL Spikes (at least 2-3K TPS) c) Reliably and efficiently replicating TOAST Columns (ex: large JSONBs). d) Advance data-types - BLOB, CLOB, HSTORE, ARRAYs etc.

In my previous role with the Azure Postgres team, we observed that Azure DMS worked pretty seamlessly in migrations that did not involve the above situations. Once either of the above came in, the replication experience was hairy. Interestingly, in my current gig at PeerDB, many use-cases we are seeing have either of the above situations. We have been doing quite a surgical optimizations to handle those situations and they keep coming! That is the beauty of Postgres, enabling a myriad of workloads, each of the unique in their own way! :)

I've also used DMS to migrate smaller datasets (~200GB) from AWS RDS MySQL 5.7 to RDS PostgreSQL 11. 10 seconds of downtime here was actually incurred not due to the migration itself, but to enable MySQL binary logging on the RDS instance, which requires a restart, and which AWS DMS uses to replicate changes from the replication source (MySQL) to the target (PostgreSQL).

Traffic was steered to the new PostgreSQL instance not with DNS/Route 53 weighted records, but an application-level percentage enroller (based on user ID). Our particular set-up did in fact have apps talking to both databases simultaneously, each with a percentage share of traffic, and so we did not need to incur additional downtime to cut off traffic to the original source database - though now you do have to worry about eventual consistency.

I wouldn't recommend using their Schema Conversion Tool. We instead migrated the data as 1-to-1 as possible into the PostgreSQL target table, and then used plpgsql triggers on the replication target table to clean/validate the data and write it to another table with our desired (and more strongly typed - no more storing JSON as TEXT) schema.

There were also some issues with the replication task seeming to stall and stop during the change data capture (CDC) phase. As mentioned support is kind of spotty in this area, though we learned it may have to do with sizing the DMS replication instance correctly.

We didn't even use red/green deployment; just had a multi-AZ deployment and migrated from 11.x -> 15.2 with about 30 seconds of downtime.

No dramas really, didn't even need logical replication.

Hope the picture is illustrative only.

On a similar case AWS DMS did not work for my team and we proceeded with pglogical.

Although it has some limitations it worked pretty well, and personally I like that it is cloud agnostic. Definitively I would recommend to consider pglogical first.

They could have probably used pgpool [1] to avoid relying on the DNS (you never know in the chain who's not respecting the TTL).

With this, you can choose which queries are ending up where - and migrating the DB endpoint can be done once, centrally

[1]: https://www.pgpool.net/docs/latest/en/html/