Post Mortem on Cloudflare Control Plane and Analytics Outage

stemlord

I'm ultimately glad this happened because it very effectively helps illustrate how we are assigning a centralized gatekeeper to the internet at the infrastructure level and why it's a bad thing.

burroisolator

While debatably unprofessional to blame your vendor, I found this read to be fascinating. I'm sure there are blog posts that detail how data centers work and fail but it's rare to get that cross over from a software engineering context. It puts into perspective what it takes for an average data center of this class to fail: power outage, generator failure, and then battery loss.

weird-eye-issue

Wow they REALLY buried this important part didn't they! This took a ton of scrolling:

"Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04."

Bingo, there we have it.

indigomm

In my experience, power is the most common data center failure there is. Often it's the redundant systems that cause the failure.

weird-eye-issue

And that's completely unrelated to my comment but thanks for the insight

dmix

What does PDX-04 mean here? Not familiar with how data centers work.

edaemon

PDX is the airport code for Portland, Oregon, USA. It's the fourth Portland data center.

weird-eye-issue

Read the damn article! It's explained at the top.

rickstanley

I think this comment falls into this:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

https://news.ycombinator.com/newsguidelines.html

worksonmine

Nah, if only the data center would've stayed up this wouldn't have been a problem. It's clearly on the data center. /s

dclowd9901

Yep, past the part where they spent a long time blaming the data center and power company.

sackfield

Its weird that upon reading this post, I have less confidence in Cloudflare. They basically browbeat Flexential for behaving unprofessional, which, yes, they probably did. However the fact that this causes entire systems that people rely on to go down is a massive redundancy failure on Cloudflares part, you should be able to nuke one of these datacentres and still maintain services.

Very worrying is they start by stating their intended design:

> Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon

You need way more geographic dispersion than that, this control pane is used by people across the world. We are still on the intended design, not the flawed implementation by the way, which is wild to me.

> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

I don't understand why this would ever be done in this way. If Cloudflare is making a new product for consumers shouldn't redundant design be at the forefront here? I am surprised that it was even an option. For the record I do use Cloudflare for certain systems and I use it because I assume it has great failovers if events like this occur making me not have to worry about these eventualities, but now I will be reconsidering this, how do I actually know my cloudflare workers are safe from these design decisions?

> When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services.

Yeh I'll bet, its because Cloudflares core design is not redundant.

Really disappointed in this blog post trying to shift the blame to Flexential when this slapdash architecture should be the main problem on show. As a customer I don't care if Flexential disappears in an earthquake tomorrow, I expect Cloudflare to handle it gracefully.

onionisafruit

Is the Hillsboro thing is about latency?

DylanSp

That may well be part of it, some people were talking about the impact of latency in the outage thread [1].

[1] https://news.ycombinator.com/item?id=38113952

alberth

Taking the positive outlook …

Ensuring all of their services are fully distributed is now top of mind at CF.

Ultimately, customers win if CF executes.

whoknowsidont

>While there were periods where customers were unable to make changes to those services, traffic through our network was not impacted.

They're just going to straight up lie like that? We definitely weren't able to get "traffic through [their] network" through the outage at many different random points.

So if the CF team is under the impression traffic was not impacted, dig deeper.

someonehere

I asked a sales rep once about services going out and how that would affect CF For Teams. They said it would be virtually impossible for CF to go down because of all their data centers around the world. Paraphrasing, “if there’s an outage, there’s definitely something going wrong with the internet.”

And here we are. My trust in them has hit zero.

belter

Trust? Random sample from the last 60 days...

Cloudflare outage – 24 hours now - https://news.ycombinator.com/item?id=38112515

Cloudflare Dashboard Logins Failing - https://news.ycombinator.com/item?id=38112230

Ask HN: Cloudflare Workers are down? - https://news.ycombinator.com/item?id=38074906

Cloudflare API, dashboard, tunnels down - https://news.ycombinator.com/item?id=38014582

Cloudflare Intermittent API Failures for Cloudflare Pages, Workers and Images - https://news.ycombinator.com/item?id=37819045

Cloudflare Issues with 1.1.1.1 public resolver and WARP - https://news.ycombinator.com/item?id=37762731

Cloudflare – Network Performance Issues - https://news.ycombinator.com/item?id=37604609

Cloudflare Issues Passing Challenge Pages - https://news.ycombinator.com/item?id=37336743

cj

FWIW, I'm a Cloudflare Enterprise customer and we had zero downtime. Only thing that was temporarily unavailable was the cloudflare dashboard.

I feel like a lot of people in this thread are commenting under the impression that all of Cloudflare was down for 24 hours when in reality I wouldn't be surprised if a lot of customers were unaffected and unaware of the incident.

I wouldn't even have known of the outage had it not been for HN..

ianhawes

2nd this. We had zero downtime on anything in production. The only reason we knew is because we are actively standing up a transition to R2 and ran into errors configuring buckets.

usr1106

Why would you trust a sales rep?

Even honest engineers cannot foresee the exact cascading consequences effects of such outages. Sales reps are not paid to be either competent on such issues nor to be honest.

marcinzm

> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning

A minor point but this feels like not the most efficient way to manage an emergency. Having some form of staggered shifts or other approach versus just having everyone pile on. If a lot of knowledge resided in specific individuals so they are vital to an effort like this and cannot be substituted then that seems like a risk in it's own.

nabla9

They don't always know it, but all large systems are moving gradually towards dependency management system with logic rules that covers "everything", physical, logical, human and administrative dependencies. Every time something new not covered is discovered, new rules and conditions are added. You can do it with manual checklists, multiple rule checkers, or put everything together.

I suspect that in end it's just easier to put everything into single declarative formal verification system and see if new change to the system passes, transition between configurations passes etc.

lytedev

This is such an interesting way of putting it. I think this has been the subconscious reason I've been gravitating towards defining _everything_ I manage personally (and not yet at work) with Nix. It's not quite to the extent you're talking about here, of course, but in a similar vein at least.

thomasdeml

Poor doc: You had a high availability 3 data center setup that utterly failed. Why spend the first third of the document blaming your data center operator? The management of the data center facility is outside of your control. You gambled that not appropriately testing your high-availability setup (under your control) would not have consequences. You should absolutely discuss the DC management with your operator, but that's between you and them and doesn't belong in this post mortem.

technotarek

True, not snippy: I found interesting that their automated billing emails seemed to arrive right on time.

nijave

HA, maximum redundancy

kurok

So well known fails. Same was with one of our DC 3 years ago when they powered city and failed whole DC

nickdothutton

Consider modes of failure.

ShadowRegent

Interesting choice to spend the bulk of the article publicly shifting blame to a vendor by name and speculating on their root cause. Also an interesting choice to publicly call out that you're a whale in the facility and include an electrical diagram clearly marked Confidential by your vendor in the postmortem.

Honestly, this is rather unprofessional. I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor's.

Clearly, a lot went wrong and Flexential needs to do their own postmortem, but Cloudflare doesn't need to make guesses and do it for them, much less publicly.

jmbwell

If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It might also be an effort to get out in front of the story before someone else does the speculating.

In any case, with at least three parties involved, with multiple interconnected systems… if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

Edit to add: I for one am grateful for the information Cloudflare is sharing.

arrakeenrevived

>If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It's been 2 days. I doubt PGE or Flexential even have root caused it yet, and even if they have, good communication takes time.

You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

You also don't publicly share what "Flexential employees shared with us unofficially" (quote from the article) - what a great way to burn trust with people who probably told you stuff in confidence.

>if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

They can do all of that without smearing people on their company blog. In fact, they can do all of that without even knowing what happened to PGE/Flexential, because per their own admission they were already supposed to be anticipating this, but failed at it. Power outages and data center issues are a known thing, and is exactly why HA exists. HA which Cloudflare failed at. This post-mortem should be almost entirely about that failure rather than speculation about a power outage.

worksonmine

Yeah I agree. The data center should be able to blow up without causing any problems. That's what Cloudflare sells and I'm surprised a data center failure can cause such problems.

Going into such depths on the 3rd party just shows how embarrassing this is for them.

creshal

Especially since it shouldn't matter why the DC failed — Cloudflare's entire business model is selling services allegedly designed to survive that. 99% of the fault lies with Cloudflare for not being able to do their core job.

corobo

In all fairness the rest of the article is about that

andyjohnson0

> We are a relatively large customer of the facility, consuming approximately 10 percent of its total capacity.

I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?

jsnell

The part you're missing is that their business is not actually that large (<$1 billion of revenue in 2022 - still deep in the red - and 3k employees).

vitus

> I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?

Google for one has both. Some GCP regions [0] are in colos, while others are in places where we already had datacenters [1]. We also use colo facilities for peering (and bandwidth offload + connection termination).

I'm under the impression that most AWS Cloudfront locations are also in colo facilities.

[0] https://cloud.google.com/about/locations

[1] https://www.google.com/about/datacenters/locations/

DylanSp

I'm a little surprised too; I figured they would have their own DCs for their core control plane servers. Colos for their 300+ PoPs makes sense, though.

NicoJuicy

You thought that they build > 300 DC's?

Colo is much more flexible, cheaper and quicker to start. Definitely since they sit close to the end-user on the data plane.

andyjohnson0

> You thought that they build > 300 DC's?

I have no idea how many DCs they have or operate in. Where does "300" come from?

> Colo is much more flexible, cheaper and quicker to start. Definitely since they sit close to the end-user on the data plane.

I understand that, but it has the disadvantage of reduced control and observability - particularly in the event of an outage such as that described in the blog post.

I kind of assumed that top-tier cloud platforms like AWS/Azure/GCP operate out of dedicated DCs, and that CF are similar because of their well-known scale of operations. Since my original comment has been downvoted†, someone presumably thinks this it was a naive or trivial question - although I don't understand why.

(† I don't much care about downvotes, but I do take them to be a signal.)

NicoJuicy

Probably most of us follow Cloudflare a bit more closely.

They want DC's close to every big city. I think most of us knew that they can't launch > 300 DC's in such a short amount of time.

The many amount of DC's is mentioned a lot ( social networks, blogs, here).

There is a distinction between eg. AWS / Azure / ... Which work with a couple of big DC's, while cloudflare operates more spread across more locations.

You're comment did made me realize it may may not be that clear from an outsider viewpoint though ( fyi, I'm an outsider too)

yowai

As someone who was slightly affected by this outage, I personally also find this post-mortem to be lacking.

75% of the post-mortem talks about the power outage at PDX-04 and blames Flexential. Okay, fair - it was a bit of a disaster what was happening there judging from the text.

But by end of November 2 (UTC), power was fully restored. It still took ~30 hours according to the post-mortem for Cloudflare to fully recover service. This was longer than the outage, and the text just states that too many services were dependent from each other. But I'd wish they go into more detail here why the operation as a whole took that long. Are there any take-aways from the recovery process, too? Or was it really just syncing data from the edges back to the "brain" that took this long?

Also one aspect I am missing here is the lack of communication - especially to Enterprise customers. Cloudflare support was basically radio silent during this outage except for the status page. Realistically, they couldn't do much anyway. But at least any attempt at communication would be appreciated - especially for Enterprise customers, and even more especially after the post-mortem blames Flexential for a lack of communication.

While I like Cloudflare since it's a great product, I think there are still a few more things that should be taken as a conclusion for CF to take away from this incident.

That being said, glad you managed to recover, and thanks for the post-mortem.

ecs78

I think they just wanted a quick post-mortem. I'm sure they will add more to the blog later in the year when they implement mitigations.

DylanSp

I'm not that surprised at the relative lack of detail, given how quickly they released this; I'm surprised they published this much info so quickly. Calling it a postmortem is a bit of a misnomer, though. I'd expect a full postmortem to have the kind of detail you mention.

MatthiasPortzel

> In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.

This paragraph similarly leaves out juicy details. Exactly what services fail if logging is down? Were they built that way inadvertently? Why did no one notice?

vb-8448

> Also one aspect I am missing here is the lack of communication - especially to Enterprise customers.

They blame Flexential for lack of communication, but were the first one not saying anything.

iAMkenough

Even "we don't know why our data center is failing, but we're sending a team over to physically investigate now" would have been A+ communication in the moment.

NicoJuicy

Everything was on the status page since the start?

DC related updates:

> Update - Power to Cloudflare’s core North America data center has been partially restored. Cloudflare has failed over some core services to a backup data center, which has partially remediated impact. Cloudflare is currently working to restore the remaining affected services and bring the core North America data center back online. Nov 02, 2023 - 17:08 UTC

> Identified - Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

We will keep providing regular updates until the issue is resolved, thank you for your patience as we work on mitigating the problem. Nov 02, 2023 - 13:40 UTC

yowai

As an enterprise customer, I would expect a CSM reaching out to us informing us about the impact, getting into more details about any restoration plans and potentially even ETAs or rough prioritization to resolution on them.

In reality, Cloudflare's support team was essentially completely unavailable on Nov 2, leaving only the status page. And for most of the day, the updates on the status page were very sparse except "we are working on it", and "We are still seeing gradual improvements and working to restore full functionality.".

Yet clearer status updates were only giving starting on Nov 3. However, I still don't think I heard anything from support or a CSM during that time.

minimaul

> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.

I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.

I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.

Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.

kkielhofner

I’m curious, have these plans ever been tested in a real incident?

Like Mike Tyson says, everyone has a plan until they get punched in the face.

darkwater

Well, this is definitely NOT a blameless post-mortem!

Obviously I'm joking because they are blaming an external company (Flexential) to which they are surely paying big money for the DC space.

xyst

I wonder if CF execs aiming to use this to get out of their long term contract with them?

RockRobotRock

hot take: HN is way too biased and sympathetic towards the provider whenever an outage like this happens.

kkielhofner

Not a lot of measured takes here. It seems to be either “eh we get it, comms could have been better” or “they’re idiots”.

The first group of people have been to war. The second have not.

Tocra

I am really upset about this situation on behalf of CF however why don't they think about generating their own electricity with renewable energy sources?

duskwuff

> why don't they think about generating their own electricity with renewable energy sources?

How exactly do you imagine that working while inside a data center operated by a third party?

It's not like they let you stick some solar panels on the roof and run an extension cord to your rack.

usr1106

And why not run their own DC?

It doesn't change anything fundamentally. A complex product is only as good as the weakest link. I have worked with various employers, some world leaders at the time. All of them had seriously weak links.

logronoide

Classic distraction maneuver. This postmortem is a prime example of tech porn that diverts attention from the main issue: many at Cloudflare didn't do their job properly.

lbriner

A lot of mud-slinging on here about HA setup and CF's dealing with the problem but I can only assume people are armchair experts with no real experience of HA at the scale of CF.

"So the root cause for the outage was that they relied on a single data center.". No. Root cause was that data centre operator didn't manage the outage properly and didn't have systems in place in which case they could have avoided it + some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.

"Cloudflare has a shit reputation in my eyes, because their terrible captchas". You don't like one product so they have a shit reputation? Enough said.

"but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster." If you have ever had to do this, you know that it is never a good feeling. On-paper, yes, you should try your DR but in reality, even if it works, you lose data, you get service blips, you get a tonne of support calls and if it doesn't work, it might not even rollback again. On top of that, it isn't a case of just disconnecting something, most problems are more complicated. System A is available but not system B. Routers get a bad update but are still online, and on top of all of that, you would need some way to know that everything is still working and some problems don't surface for hours or until traffic volume is at a certain level etc. If you trust that a data centre can stay online for long periods of time and that you would then be able to migrate things at a reasonable rate if it doesn't, then you have to trust that to an extend.

All-in-all, CF are not attempting to blame someone, even though a lot is down to Flexential, the last paragraph of the first section says, "To start, this never should have happened...I am sorry and embarrassed for this incident and the pain that it caused our customers and our team."

Well done CF

kkielhofner

Couldn’t agree more.

Many of these comments sound like they’re coming from some mythical alternate universe where bugs don’t exist and people and orgs have 100% flawless execution every time.

It reminds me a little of someone sitting at a sports bar yelling about a “stupid” play or otherwise criticizing a 0.0001% athlete who is playing at a level they can’t possibly fathom.

Monday Morning quarterbacking.

menaerus

> some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.

I mean you're contradicting yourself in the same sentence. Had CloudFlare had such a system in place that would allow that particular center to fail, there would be no outages in the service. The truth is that they didn't account for it , and because they missed it, that center became a single point of failure which is what brought the whole CloudFlare service down. Power outage was just a trigger to discover a weakness in their system design and not a root cause.

awesomebing1

Somewhat amazed at the structure of this article: after first discussing the third-party for 75% of blog post, the first-party recovery efforts were detailed in considerably lesser paragraphs. It’s promising to see a path forward mentioned but I can’t help but wonder why this was published instead of currently acknowledging their failure/circumstances and later on publishing a complete post-mortem after the dust fully settles (i.e. without speculation).

xyst

To make sure their stonk doesn’t drop at market open next week. Investors will read this (or get the sound bites) and shrug it off as some vendor issue rather than deep issue that will require months of rework (millions of dollars and thus impacting earnings)

ahoka

It’s called “shifting the blame”.

sidcool

They really threw the electricity power provider under the bus there.

tux3

The electricity provider is fine, it's Flexential that looks incredibly opaque and non-communicative in a stressful situation.

While Cloudflare should have been better prepared for this, it seems to be amateur hour in that particular Portland data-center. Other customers (Dreamhost, etc) were impacted too, and I can't imagine they don't also have some very pointed questions.

SentinelRosko

Sure, but DreamHost recovered fully within 12 hours [1], Cloudflare took almost 2 days [2]

[1] https://www.dreamhoststatus.com/pages/incident/575f0f6068263... [2] https://www.cloudflarestatus.com/incidents/hm7491k53ppg

__turbobrew__

> However, we had never tested fully taking the entire PDX-04 facility offline.

That is a painful lesson, but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster.

You can point fingers at the facility operators, but at the end of the day you have to be able to recover from a dc going completely offline and maybe never coming back. Mother Nature may wipe it off the face of the earth.

martinald

This is a fair point. Imagine there had been a serious fire like OVH suffered or flooding that destroyed the data center. Would Cloudflare have been able to recover?

NicoJuicy

That's not what happened here. Their edge worked fine.

Business was mostly running as usual.

The OVH outage was immediate downtime.

creshal

Most likely, yes. They have enough customer lock-in that enough customers would stick with them even if it took them a week to rebuild everything from in other DCs.

pseudocoder_sp

Some one should make a webseries about this indecent. It will be a nice story to tell. Name: Mordern Day Disaster. Directed by : Mathew prince Releasing on : 25th December at Netflix Based on a true story.

iot_devs

Why the very first step was not to fail over Europe?

ahoka

My question too, although possibly it seemed as a greater risk first to fail over. BTW, is there any unexpected GDPR implication of that? Assuming that fail over means restoring US backups in EU.

MRtecno98

iirc the GDPR prohibits storing EU data in non-EU servers, not vice-versa

pests

They did after two hours. After the first they assumed the generators would be back but then ran into the breaker issue which caused the full day delay.

Dunedan

> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.

> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.

So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

cyberax

> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

It's amazing that they don't have standards that mandate all new systems to use HA from the beginning.

troyvit

> I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.

So do they.

SushiHippie

And the top comment on the other HN post called it: https://news.ycombinator.com/item?id=38113503

davedx

And that this was unironically written in the same post mortem: “We are good at distributed systems.”

There’s a lack of awareness there.

brookst

Good != infallible

steve1977

Well, they did distribute their systems. Some were in the running DC, some were not ;)

belter

They distributed the faults across all their customers....

emadda

Their uptime was eventually consistent

ecs78

haha. The control plane was eventually consistent after 3 days

ZiiS

They are good at systems that are distributed; they are very bad at ensuring systems they sell thier custoners are distributed.

creshal

> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!

> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.

Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?

organsnyder

Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.

brookst

> Complete and utter management failure

Too strong. A failure certainly, but painting this as the worst possible management failure is kind of silly.

davewritescode

I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.

byteknight

There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.

belter

There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...

"Cloudflare Reverse Proxies Are Dumping Uninitialized Memory" - https://news.ycombinator.com/item?id=13718752

marcinzm

> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?

This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.

arrakeenrevived

I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.

simion314

>I find that pretty embarrassing for a company like Cloudflare,

Cloudflare has a shit reputation in my eyes, because their terrible captchas , like they are broken but they ship that for millions to suffer. Latest problem for me is ChatGPT, I get a captcha at any query on Firefox, changing to Chromium and I get zero. And the captcha is also broken, I use the audio version, I put the correct answer and they claim is wrong. I try again and they have the exact same question, they always ask about a cat sound , probably the system got broken on an update and they did not notice yet.

So IMO do not expect quality from them or OpenAI dev ops.

sophacles

Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.

thelastparadise

> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Absolute lack of faith in cloudflare rn.

This is amateur hour stuff.

It's especially egregious that these are new services that were rolled out without HA.

NicoJuicy

?

Tbh. As far as I can see, their data plane worked at the edge.

Cloudflare released a lot of new products and the ones that affected were: streams, new image upload and logpush.

Their control plane was bad though. But since most products worked, that's more redundancy than most products.

The proposed solution is simple:

- GA requires to be in the high availability cluster

- test entire DC outages

jorams

The combination of "newer products" and then having "our Stream service" as the only named service in the post-mortem is very odd, since Stream is hardly a "newer product". It was launched in 2017 and went GA in 2018[2]. If after 5 years it still didn't have a disaster recovery procedure I find it hard to believe they even considered it.

[1]: https://blog.cloudflare.com/introducing-cloudflare-stream/ [2]: https://www.cloudflare.com/press-releases/2018/cloudflare-st...

yowai

From what I was reading on the status page & customers here on HN, WARP + Zero Trust were also majorly affected, which would be quite impactful for a company using these products for their internal authentication.

It's not just streams, image upload & Logpush.

kurok

This was short downtime. But big companies must create own gateway but small just waiting and relying on CF

NicoJuicy

Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

The data plane ( which I mentioned) had no issues.

It's literally in the title what was affected: "Post Mortem on Cloudflare Control Plane and Analytics Outage"

Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

Source: I watched it all happen in the cloudflare discord channel.

If you know anyone that is claiming to be affected on the data plane for the services you mentioned, that would be an interesting one.

Note: I remember emails were also more affected though.

yowai

> Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

Which was still like ~12+ hours, if we check the status page.

>Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

What good is a status page that's lying to you? Especially since CF manually updates it, anyway?

>Source: I watched it all happen in the cloudflare discord channel.

Wow, as a business customer I definitely like watching some Discord channel for status updates.

NicoJuicy

?

This wasn't about status updates going to discord only.

There is literally a discussion section on the discord, named: #general-discussions

Not everything was clear in the discord too ( eg. The healthchecks were discussed there), that's not something you want to copy-paste in the status updates...

Priority for cloudflare seemed to get everything back up. And what they thought was down, was always mentioned in the status updates.

yowai

Oh, I just looked it up and I thought you mean that CF engineers were giving real time updates there. That's not the case.

However, I still fail to see your argument regarding Zero Trust and not being impacted. The status page literally mentioned that the service was recovered on Nov 3, so I don't understand what you mean by:

>The data plane ( which I mentioned) had no issues.

There's literally a section with "Data plane impact" on all over the status page, and ZT is definitely in the earlier ones. And this is given the fact that status updates on Nov 2 were very sparse until power was restored.

throwaway6920

> Tbh. As far as I can see, their data plane worked at the edge.

Arguable, it's best to think of the edge as a buffering point in addition to processing. Aggregation has to happen somewhere, and that's where shit hit the fan.

NicoJuicy

? That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

Cloudflare's data lives in the edge and is constantly moving.

The only thing not living in the edge ( as was noticed), is stream, logpush and new image resize requests ( existing ones worked fine) from the data plane

throwaway6920

>That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

You're being loose in your usage of 'data'. No one is talking about cached copies of an upstream, but you probably are.

Read the post mortem a bit more closely. They explicitly state that the control plane(s) source of truth lives in core, and that logs aggregate back to core for analytics and service ingestion. Think through the implications on that one.

e1g

That’s my interpretation as well. There is one central brain, and “the edge” is like the nervous system that collects signals, sends it to the brain, and is _eventually consistent_ with instructions/config generated by the brain.

pests

> It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators.

Are the data centers compensated or anything for this? I'd imagine generator-only might cost more in terms of fuel and wear-and-tear/maintinaince/inspections.

edit:

> DSG allows the local utility to run a data center's generators to help supply additional power to the grid. In exchange, the power company helps maintain the generators and supplies fuel

Interesting.

nijave

I'm not very well versed in this space but I've been told Progressive Insurance in Cleveland, OH has a similar (sounding) agreement. According to PGE's website, they basically pay for everything https://portlandgeneral.com/save-money/save-money-business/d...

solatic

Not criticism, just remarks:

> While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA).

I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA. But most companies really struggle with this, especially with ensuring that customers are never exposed to a/b/p software by default, requiring opt-in from the customer, allowing the customer to easily opt-out, and ensuring that using a/b/p software never endangers GA features they depend on. Building that out, if it's even on a company's internal Platform/DevX backlog, is usually super far down as a "wishlist" item. So I'm super interested to see what Cloudflare can build here and whether that can ever get exposed as part of their public Product portfolio as well.

> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.

Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.

Between the Pages outage and the API outage happening in one week, I was considering selling my NET stock, but reading a postmortem like this reminds me why I invested in NET in the first place. Thanks Matt.

dopylitty

> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.

>> Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.

On the other when a company dogfoods its own products you end up in a dependency hell like AWS apparently is in where a single Lambda cell hitting full capacity in us-east-1 breaks many services in all regions.

I'm sure there is a right way to manage end to end dependencies for 100% of your services past, present, and future but increasingly I'm of the opinion that it's not possible in our economic system to dedicate enough resources to maintain such a dependency mapping system since that takes away developer time from customer facing products that show up in the bottom line. You just limp along and hope that nothing happens that takes out your whole product.

Maybe companies whose core business is a money printing machine (ads) can dedicate people to it but companies whose core business is tech probably don't have the spare cash.

marcinzm

> Security

Security is what keeps a single service getting breached from causing the whole company to get breached.

> Privacy/Legal

Cloudflare doesn't get indemnification from the law just because a customer agrees to mutually break the law.

throwaway6920

> I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA.

Speaking from personal experience, what you're claiming as 'good', for CF meant SRE- usually core, but edge also suffered- got stuck with trying to fix a fundamentally broken design that was known faulty- and called faulty repeatedly- but forced through.

Nothing about this is desirable or will end well.

This reckoning was known and raised by multiple SRE near a decade before this occurred, and there were multiple near misses in the last few years that were ignored.

The part that's probably funny- and painful- for ex-CF SRE is that the company will do a hard pivot and try to rectify this mess. It's always harder to fix after, rather than building for, and they've ignored this for a long while.

solatic

I'm not sure if you understood my argument? I'm arguing that it's fine to ship a "fundamentally broken design" as long as the company makes abundantly clear that such software is shipped as-is, without warranty of any kind, MIT-license-style. Ramming that kind of software through to GA without unanimous sign-off from all stakeholders (infra/ops, sec, privacy/legal, etc.) is fundamentally unacceptable under such a model. Maybe there's an argument to be made that such a model is naïve, that in practice the gatekeepers for GA will always be ignored or overruled, but I would at least prefer to think that such cases are examples of organizational dysfunction rather than a problem with the model itself, which tries to balance between giving Product the agility it needs to iterate on the product, Infra/Sec/Legal concerns that really only apply in GA, and Ops (SRE) understanding that you can't truly test anything until it's in production; the same production where GA is.

TekMol

Contrary to others here, I find the postmortem a bit lacking.

The TLDR is that CF runs in multiple data centers, one went down, and the services that depend on it went down with it.

The interesting question would be why those services did depend on a single data center.

They are pretty vague about it

    Cloudflare allows multiple teams to innovate quickly. As such,
    products often take different paths toward their initial alpha.
If I was the CEO, I would look into the specific decisions of the engineers and why they decided to make services depend on just one data center. That would make an interesting blog post to me.

Designing a highly available system and building a company fast leads to interesting tradeoffs. The details would be interesting.

nijave

>I would look into the specific decisions of the engineers and why they decided to make services depend on just one data center

And the product team defining requirements

And IT/governance/architecture teams for not properly cataloging dependencies

And the sales and marketing team not clearly articulating what they're selling (a beta/early access product that's not HA)

kccqzy

> why they decided to make services depend on just one data center

In my experience, no engineers really decided to make services depend on just one data center. It happened because the dependency was overlooked. Or it happened because the dependency was thought to be a "soft dependency" with graceful degradation in case of unavailability but the graceful degradation path had a bug. Or it happened because the engineers thought it had a dependency on one of multiple data centers, but then the failover process had a bug.

Reminds me of that time when a single data center in Paris for GCP brought down the entire Google Cloud Console albeit briefly. Really the same thing.

throwaway6920

> In my experience, no engineers really decided to make services depend on just one data center.

Partially true in this case; I can't speak to modern CF (or won't, moreso) but a large amount of internal services were built around SQL db's, and weren't built with any sense of eventual consistency. Usage of read replicas was basically unheard of. Knowing that, and that this was normal, it's a cultural issue rather than an "oops" issue.

Flipping the whole DC data sources is a sign of what I'm describing; FAANG would instead be running services in multiple DC's rather than relying on primary/secondary architecture.

disgruntledphd2

Dunno about that, I've read similar internal postmortems at the FAANG I worked at.

nanankcornering

I experienced it myself within the last 24 hours. New D1 & Hyperdrive deployment was not working. It would spew out internal errors & timeouts.

Both are non-GA products, and the point is that non-GA are not part of the HA cluster (yet)

pests

Isn't this sentence a bit further down more clear?

> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

and

> It [PDX-04] is also the default location for services that have not yet been onboarded onto our high availability cluster.

austinkhale

I love how thorough Cloudflare post mortem’s are. Reading the frank, transparent explanations are like a breath of fresh air compared to the obfuscation of nearly every other company comm’s strategy.

We were affected but it’s blog posts like these that make me never want to move away. Everyone makes mistakes. Everyone has bad days. It’s how you react afterwards that makes the difference.

Voloskaya

> Everyone makes mistakes. Everyone has bad days.

The issue is when you start having bad days every other day though. We use and depend on CloudFlare Images heavily, it has now been down more than 67 hours over the last 30 days (22h on October 9th, 42h Nov 2 - Nov 4 and a sprinkle of ~hour long outages in between). That's 90.6% availability over the last month.

Transparency is a great differentiator between providers that are fighting in the 99.9% availability range, but when you are hanging on for dear life to stay above the one 9 availability, it doesn't matter.

ecs78

They are a younger company than these other providers. Microsoft, Google, and AWS had their own growth pains and disasters. Remember when Microsoft deleted all the data (contacts, photos, etc) off all their customers Danger phones by accident and had no backup. Talk about naming their product a self-fulfilling prophecy.

SentinelRosko

I would generally agree with you, but this post mortem was 75% blaming Flexential even though it took them almost two days to recover after power was restored. The power outage should have been a single paragraph and then pivoted - DC failures happen, its part of life. Failing to properly account for and recover from it is where the real learnings for Cloudflare are.

ecs78

It was more of an incident report. The efforts to get back online were mostly around Flexential, so it makes sense to dive in to their failings. That said, it is clear there were major lapses of judgement around the control plane design since they should be able to withstand an earthquake. That they don't have regular disaster recovery testing of the control plane and its dependencies seems crazy. I wonder if it is more that some of those dependencies they hoped to eliminate and replace with in-house technology and hedged their bets on the risk.

w10-1

I agree, but I also think that for security purposes they should leave out extraneous detail. Also, I know they want to hold their suppliers accountable, but I would hold off pointing fingers. It doesn't really improve behavior, and it makes incentives worse.

I really appreciate that they're going to fix the process errors here. But as they suggested, there's a tension between moving fast and being sure. This is typically managed like the weather, buying rain jackets afterwards (not optimal). I'd be curious to see how they can make reliability part of the culture without tying development up in process.

Perhaps they can model the system in software, then use traffic analytics to validate their models. If they can lower the cost of reliability experiments by doing virtual experiments, they might be able to catch more before roll-out.

NicoJuicy

> know they want to hold their suppliers accountable

They do both. They stated what their problem was and they stated their due diligence in picking a DC

> While the PDX-04’s design was certified Tier III before construction and is expected to provide high availability SLAs

They said the core issue: innovating fast, which led to not requiring in the high availability cluster.

Which is also a fix.

From cloudflare 's POV, part of what made it originally worse, is the lack of communication by the DC.

Which is an issue, if you want to inform clients.

devwastaken

What "security purposes"? Good security isn't based on ignorance of a system, it is on the system being good. We create a self fulfilling prophecy when we hide security practices because what happens is then very few will properly implement their security. Openness is necessary for learning.

logifail

> I also think that for security purposes they should leave out extraneous detail

Disagree completely, it's the frank detail that makes me trust their story.

ecs78

Cloudflare vowed to be extremely transparent since the start of their existence. I'm very happy with the fact they have managed to keep this a core company value under extreme growth. I hope it continues after they reach a stable market cap. It isn't like Google that vowed not to be evil until they got big enough to be susceptible to antitrust regulation and negative incentives related to ad revenue.

ShadowRegent

Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.

DylanSp

Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.

ImPostingOnHN

I prefer to have their informed speculation here.

Has Flexential provided a similarly detailed, public root cause analysis? If so, maybe we can refer to it. If not, how do you expect us to read it?

ShadowRegent

It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.

tebbers

I think overall Cloudflare did a decent job on this. Clearly the DC provider cocked up big time here, but Cloudflare kept running fine for the vast majority of customers globally. No system is perfect and it’s only apocalyptic scenarios like this where the vulnerabilities are exposed - and they will now be fixed. Hope the SRE guys got some rest after all that stress.

vb-8448

> Clearly the DC provider cocked up big time here

Actually, this is the CF version, maybe Flexential will come out with a different one.

BTW, if you design a system to survive a DC failure, you cannot blame the DC failure.

divbzero

Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon. The three data centers are independent of one another, each have multiple utility power feeds, and each have multiple redundant and independent network connections. The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted, while still close enough that they could all run active-active redundant data clusters.

If the three data centers are all around Hillsboro, Oregon, an earthquake could probably take out all three simultaneously.

ahoka

They would then go into very detailed description how tectonic activity caused the outage.

pests

> Hillsboro, Oregon, an earthquake could probably take out all three simultaneously.

Is it west of I5?

(yes)

Oh yeah, they all gone.

Cascadia Subduction Zone - https://pnsn.org/outreach/earthquakesources/csz

ZiiS

And most of thier SREs. Spending 30 hours to recover from the worst natural disaster in recorded history is slightly diffrent then from a ground fault on a single transformer.

bell-cot

Wikipedia's entry for Hillsboro: "Elevation 194 ft (60 m)"

Between that, and being ~50 miles inland - I'd say there's ~zero threat of Cascadia quakes or tsunamis directly knocking out those DC's. (Yeah, larger-scale infrastructure and social order could still be killers.)

OTOH - Mt. St. Helens is about 60 miles NNE of Hillsboro. If that really went boom, and the wind was right...how many cm's of dry volcanic ash can the roofs of those DC's bear? What if rain wets that ash? How about their HVAC systems' filters?

pests

I was never worried about the tsunami. Okay, maybe not gone, but I wouldn't say it would be operational.

https://www.oregon.gov/oem/Documents/Cascadia_Rising_Exercis...

50% of roads and near 75% of bridges damaged on the west coast and the I5 corridor.

Refer to PDF page #93 where over 70% of power generation is highly damaged on the I5 corridor and 60% in the coastal areas with 0% undamaged.

Highly damaged - "Extensive damage to generation plants, substations, and buildings. Repairs are needed to regain functionality. Restoring power to meet 90% of demand may take months to one year."

"In the immediate aftermath of the earthquake, cities within 100 miles of the Pacific coastline may experience partial or complete blackout. Seventy percent of the electric facilities in the I-5 corridor may suffer considerable damage to generation plants, and many distribution circuits and substations may fail, resulting in a loss of over half of the systems load capacity (see Table 22). Most electrical power assets on the coast may suffer damage severe enough as to render the equipment and structures irreparable"

bell-cot

Good backups generators at their colo's could handle the lack of utility power for days to weeks. More & better generators could be hauled in and connected.*

The two big problems I'd see would be (1) Social Order and (2) Internet Connectivity. DC's are not fortresses, and internet backbone fibers/routers/etc. are distributed & kinda fragile.

*After all the large-scale power outages & near-outages of recent decades, Cloudflare has no excuse if they lack really-good backup generators at critical facilities. And with their size, Cloudflare must support enough "critical during major disaster" internet services to actually get such generators.

nanankcornering

thats why DRC exists right?

nanankcornering

At what time were you notified Matt?

eastdakota

Of the incident? Someone on my team called me about 30 minutes after it started. It was challenging for me to stay on top of because it was also the same day as our Q3 earnings call. But team kept me informed throughout the day. I helped where I could. And they handled a very difficult situation very well. That said, lots we can learn from and improve.

iAMkenough

Coincidental timing?

mercurialuser

Did you rebuild all the server from scratch?

chaz6

What I find bizarre is that the Cloudflare share price jumped when the outage happend!

Having read the post mortem, I do not think it could have been handled any better. I think the decision to extend the outage in order to provide rest was absolutely correct.

I always enjoy reading these reports from Cloudflare as they are the best in the business.

dboreham

There's a class of investor (and their trade bots presumably) that sees outrage over a service outage as proof the provider is now mission critical, hence able to "extract value" from the market.

eastdakota

I was surprised we didn't get a single question about it from an analyst or investor, either formally on the Q3 call or on any callbacks we did after. One weird phenomenon we've seen — though not so much in this case because the impact wasn't as publicly exposed — is that investors after we've had a really bad outage say: "Oh, wow, I didn't fully appreciate how important you were until you took down most of the Internet." So… ¯\_(ツ)_/¯

matdehaast

Some days things just go badly. The only thing you can change is how you respond. Well done to you and the team for getting through this.

I for one am and will always be a cloudflare customer