return to table of content

Cloudflare API Down

When I worked there (3+ years ago), if PDX were out then "the brain" was out... things like DDoS protection was already being done within each PoP (so that will be just fine, even for L3 and L7 floods, even for new and novel attacks), but nearly everything else was done with the compute in PDX and then shipped to each PoP as configuration data.

The lifecycle is: PoPs generate/gather data > send to PDX > compute in PDX > ship updates / data to PoPs.

If you take out PDX, then as so much runs on fresh data, it starts getting stale.

I doubt everything has changed since then, so this is unlikely just "API down" and more likely that a lot of things are now in a degraded state as they're running on stale information (no update from PDX)... this includes things like load balancing, the tiered caching (Argo Smart Routing), Warp / Zero Trust, etc.

Even if it were only "API down", then bear in mind that a lot of automation customers have will block attacks by calling the API... "API down" is a hell of a window of opportunity for attackers.

Note that just before I'd left they'd been investing in standing up AMS (I think) but had never successfully tested a significant failover, and the majority of services that needed fresh state did not know how to do this.

PS: :scream: most of the observability was also based in PDX, so hugs to all the teams and SREs currently running blind.

Someone else posted about PDX02 going down entirely[0], so sounds like this is the root cause, especially with the latest status update.

> Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

> [0]: Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet). Some utility power is back, so recovery is in progress for some portion of the site.

[0]: https://puck.nether.net/pipermail/outages/2023-November/0149...

I think every datacenter I've ever worked with, across ~4 jobs, has had an incident report like "generator failed right as we had an outage."

Am I unlucky, or is there something I miss about datacenter administration that makes it really hard to maintain a generator? I guess you don't hear about times the generator worked, but it feels like a high rate of failure to me.

Even the high profile datacenters I had to deal with in Frankfurt had the same issues. There were regular maintenance tests where they made sure the generators were working properly... I can imagine this is more of a pray and sweat task than anything that's in your hands. I have no clue why this is the status quo though.

I wonder why we don't put battery backups in each server/switch/etc. Basically, just be a laptop in each 1U rack space instead of a desktop.

Sure, you can't have much runtime, but if you got like 15 minutes for each device and it always worked, you could smooth over a lot of generator problems when something chews through the building's main grid connection.

One challenge is that the power usage of a server is order(s) of magnitude greater than that of a laptop. This means the cost to do what you describe is significant, hence that has to be taken into account when trying to build a cluster that is competitive...

Yeah, I agree with that. I think that power savings are a big priority for datacenters these days, so perhaps as more efficient chips go into production, the feasibility of "self-contained" servers increases. I could serve a lot of websites from my phone, and I've never had to fire up a diesel generator to have 100% uptime on it. (But, the network infrastructure uses more power than my phone itself. ONT + router is > 20W! The efficiency has to be everywhere for this to work.)

It’s pretty common to have a rack of batteries that might serve an isle. The idea of these is that you’d have enough juice for the generator to kick in. You couldn’t run these for longer periods, and even if you could, you’d still have the AC unpowered, which would quickly lead to machines overheating and crashing. Plus the building access controls need powering too. As does lighting, and a whole host of other critical systems. But the AC alone is a far more significant problem than powering the racks. (I’ve worked in places when the AC has failed, it’s not fun. You’d be amazed how much heat those systems can kick out).

In my experience, you have building UPS on one MDU and General supply on the other. Building UPS will power everything until generators spin up, and if the UPS itself dies then you're still powered from general supply

Did lose one building about 20 years ago when the generator didn't start

But then I assume that any services I have which are marked as three-nines or more have to be provided from multiple buildings to avoid that type of single point of failure. The services that need five-nines also take into account loss of a major city, beyond that there's significant disruption though -- especially with internet provision, as it's unclear what internet would be left in a more widespread loss of infrastructure.

They most likely lie about the power outage. Azure does this all the time. I am so fking tired of these services. Even minor data centers in Europe have backup plans for power.

The phone utility were I live has deisel generators that kick on whenever the power goes out in order to keep the copper phone lines operational. These generators always work, or at least one of the four they have in each office does.

The datacenter I was in for awhile had the big gens, and with similar "phone utility" setups - they would cut to the backup gens once a month and run for longer than the UPS could hold the facility (if they detected an issue, they'd switch back to utility power).

They also had redundant gensets (2x the whole facility, 4x 'important stuff' - you could get a bit of a discount by being willing to be shut off in a huge emergency where gens were dying/running out of fuel).

Cost of that likely is ginormous compared to their SLA obligations.

These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.

"Typical" design would be: Each cabinet fed by 2 ATS (transfer switch). Each ATS fed by two UPS (battery bank). Each UPS fed by utility with generator backup. The two ATS can share one UPS/generator, so each cabinet would be fed by 3 UPS+generator. A generator failing to start shouldn't be a huge deal, your cabinet should still have 2 others.

The data center I'm currently in did have a power event ~3 years ago, I forget the exact details but Mistakes Were Made (tm). There were several mistakes that led to it, including that one of the ATS had been in "maintenance mode", because they were having problems getting replacement parts, but then something else happened as well. In short, they had gotten behind on maintenance and no longer had N+1 redundancy.

On top of that, their communication was bad. It was snowing cats and dogs, we suddenly lose all services at that facility (an hour away), and I call and their NOC will only tell me "We will investigate it." Not a "We are investigating multiple service outages", just a "we will get back to you." I'm trying to decide if I need to drive multiple hours in heavy snow to be on site, and they're playing coy...

You summed up quite well how these things happen.

All of these parts make for an increasingly complex system with a large number of failure points.

Our DC was a very living entity -- servers were being changed out/rack configuration altered very regularly. Large operations were carefully planned. You wouldn't overlook the power requirements of a few racks being added -- there'd -- literally[0] -- be no place to plug them in without an electrician being brought in. However, in a 3-month period every once in a while, two racks would have old devices replaced either due to failure or refresh, one at a time.

Since they weren't plugged directly into rack batteries (we had two battery rooms providing DC-wide battery backup), the overload wouldn't trip. Since we were still below the capacity of the circuit, the breaker(s) wouldn't trip. And maybe we're still under capacity for our backup system, but a few of the batteries are under-performing.

I think the lesson we learned when this happened was: you need to "actually test" the thing. My understanding is that our tests were of the individual components in isolation. We'd load test the batteries and the generator and then the relays between. At the end of the day, though, if you don't cut the power and see what happens you don't truly know. And my understanding is that having that final step in place resulted in a large number of additional tests being devised "of the individual components" that ensured they never had an outage like that, again.

[0] Guessing it's common practice to make "finding a f!cking power outlet" nearly impossible in DC. Every rack had exactly the number of leads it needed for the hardware plugged into a completely full receptacle. They rolled around a cart with a monitor, printer, label printer, keyboard, mouse and a huge UPS on it so staff could do daily maintenance work.

We had an outage a few years ago on Black Friday; we had been getting our DC to purchase and rack servers for us for years, and the data centre had IIRC four separate circuits that our servers were on depending on which rack they were in. Unfortunately, we hadn't provided them input into which servers were which purpose, and we occasionally repurposed hardware from one service to another.

This resulted in one of our microservices existing entirely on one circuit, along with significant swaths of other services. We also overprovisioned our hardware so that we could cope with huge increases in traffic, e.g. Black Friday/Cyber Monday weekend. Generally a good idea, but since our DC obviously didn't have any visibility into our utilization, they didn't realize that, if our servers suddenly spiked to 100% CPU use, it could triple our power usage.

Easy to see where this is going, I'm sure.

The microservice which existed entirely on one circuit was one of the most important, and was hit constantly - we were a mobile game company, and this service kept track of players' inventories, etc. Not something you want to hit the database for, so we had layers of caching in Redis and memcached, all of which lived on the application servers themselves, all of them clustered so that we could withstand several of our servers going offline. This meant that when we got a massive influx of players all logging in to take advantage of those Black Friday deals, the service hit probably the hardest was this service, and its associated redis and memcached clusters, as well as (to a lesser extent) the primary database and the replication nodes - some of which were also on the same circuit.

So as we're all trying to tune the systems live to optimize for this large influx of traffic, it trips the breaker on that circuit and something like 1/3 of our servers go offline. We call the CEO of the DC company, he has to call to figure out what the heck just happened, and it takes a while to figure out what the heck just went on and why. Someone has to go into the DC to flip the breaker (once they know that it's not just going to fly again), which is a several-hour drive from Vancouver to Seattle.

Meantime, we all have to frantically promote replication databases, re-deploy services to other application servers, and basically try to keep our entire system up and online for the largest amount of traffic we've ever had on 60% of the server capacity we'd planned on.

I was working on that problem (not just awake, but specifically working on that issue) for 23 hours straight after the power went out. Our CEO made a list of every server we had and how we wanted to balance them across the circuits, and then the DC CEO and I spent all night powering off servers one by one, physically moving them to different cabs, bringing them online, rinse repeat.

TL;DR electricity is complicated.

Thanks for sharing, that was a very entertaining read and gave me fond memories of the decade or so I worked in a phone switch.

My office was in a space that was shared by a large legacy phone switch, SONET node, and part of our regional data center but I worked in infrastructure doing software development. My proximity[0] meant I ended up being used to support larger infrastructure efforts at times, but it usually just meant I got a lot of good ... stories.

I wonder if there's a collection of Data Center centric "Daily WTF" stories or something similar.

For me, I think my favorite is when we had multiple unexplained power failures very late at night in our test/management DC[0]. It turned out "the big red kill switch" button behind the plexiglass thing designed to make sure someone doesn't accidentally "lean into it and shut everything off" was mistaken for the "master light switch" by the late night cleaning crew. Nobody thought about "the cleaning crew" because none of the other DCs allowed cleaning crew anywhere near them but this was a test switch (someone forgot about the other little detail). If memory serves, it took a few outages before they figured it out. The facilities manager actually hung around one night trying to witness it only to have the problem not happen (because the cleaning lady didn't turn the lights off when people were there, duh!). I'd like to say that it was almost a "maybe bugs/animals/ghosts are doing it" impulse that caused them to check the cameras but it was probably also the pattern being recognized as "days coinciding with times that the late night cleaning crew does their work."

Outside of that, the guy who made off with something like 4 of these legacy switch cards because some fool put a door stop on the door while moving some equipment in was probably really excited when he found out they were valued at "more than a car" but really disappointed when he put them on eBay for something like $20,000 (which was, I wanna say at least a 50% markdown), was quickly noticed by the supplier[2] was arrested and we were awaiting the return of our hardware.

[0] Among many other things due to a diverse 17-year career, there, but mostly just because I was cooperative/generally A-OK with doing things that "were far from my job" when I could help out my broader organization.

[1] When that went down, we couldn't connect to the management interfaces of any of the devices "in the network". It's bad.

[2] Alarms went off somewhere -- these guys know if you are using their crap, you're stuck with their crap and they really want you stuck paying them for their crap. I'm fairly certain the devices we used wouldn't even function outside of our network but I don't remember the specifics. AFAIK, there's no "pre-owned/liquidation-related market" except for stripping for parts/metals. When these things show up in unofficial channels, they're almost certainly hot.

>> These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.

At least 30% of datacenter outages that we had with a large company were due to some power related issues.

Just a simple small scale one: the technician accidentally plugged in the redundant circuits into the same source power link. When we lost a phase it took down the 2/3 of the capacity instead of 1/3. Hoops.

While what you're describing is definitely possible, but, datacenter architecture is becoming less and less bulletproof-reliable in service of efficiency (both cost as well as PUE).

Lack of preventive maintenance if I were to guess. Also, these generators would need a supply of diesel fuel, and typically have a storage tank on site. If the diesel isn't used and replaced, it can gum up the generator.

I've gotten 60 year old tractors to run on 60 year old diesel. Gumming up is much more common in gas applications. I guess modern diesel might not be so robust, I know almost nothing about modern engines.

There is nothing so satisfying as when an old engine with bad gas finally catches and starts running continuously.

Test your backups! Obviously easier said than done of course.

Experience in 'small' high availability safety-critical systems says:

1- 'failover often, failover safely'. Things that run once a month or 'just in case' are the most likely to fail.

2- people (customers) often aren't ready to pay for the cost of designing and operating systems with the availability levels they want.

Had something similar happen at a telecom I worked at for years. We had a diesel generator and a couple of (bathroom sized) rooms full of (what looked like) car batteries. My understanding is that the two rooms were for redundancy. The batteries could power the DC for hours but were used only until the generator was ready.

The area our DC was located in was impressively reliable power-wise and -- in fact -- the backup systems had managed through the multi-state power outage in the early 2000s without a hitch (short of nearly running out of fuel due to our fuel supplier being ... just a little overwhelmed).

A few years later a two minute power outage caused the DC to go dark for a full day. Upon the power failing, the batteries kicked in and a few minutes after that the generator fired up and the DC went into holy terror.

About a minute after the generator kicked in, power to the DC blinked and ended. The emergency lights kicked in, the evacuate alarm sounded[0] and panic ensued.

My very pedestrian understanding of the problem was that a few things failed -- when the generator kicked in, something didn't switch power correctly, then something else didn't trip in response to that, a set of 4 batteries caught fire (and destroyed several nearby). They were extinguished by our facilities manager with a nearby fire extinguisher. He, incidentally, was the one who pulled the alarm (which wouldn't, on its own, trigger the Halon system, I think). The remainder of the day was spent dealing with the aftermath.

We were a global multi-national telecom with a mess of procedures in place for this sort of thing. Everything was installed by electricians, to very exacting standards[1] but -- as with most things "backup" -- the way it was tested and the frequency of those tests was inadequate.

From that point forward (going on over a decade) they thoroughly tested the battery/generator backup once a quarter.

[0] We were warned to GTFO if that alarm goes off due to the flooding of chemicals that would follow a few minutes later. That didn't happen.

[1] I remember the DC manager taking over in Cleveland making his staff work weeks of overtime replacing zip ties with wax lace (and it was done NASA style). We're talking thousands and thousands of runs stretching two complete floors of a skyscraper.

I lost track of how many datacenter outages we caused testing the power backup/failover back at eBay in the mid-2000s.

There's no winning when it comes to power redundancy systems.

Datacentre administrators don't know how to run utilities.

Imagine replacing the word "power" with "sewage" and try to see if you would entrust the functionality of your toilet to your local friendly sysadmin.

No. You'd never ask a system administrator to administer your plumbing. Neither should you ask your system administrator to maintain a diesel power generator. Diesel generators have more in common with automobile internal combustion engines and in the high power segment, airplane jet turbines. In fact, many turbine cores are used as both airplane jet engine and as terrestrial power generation unit.

You're basically asking the wrong people to maintain the infrastructure.

When I worked in a DC the HVAC guys did the cooling. The electricians did the power and genset. We also had a local GE guy who did the engine part of the genset. These aren't sysadmin running generators. They are specialists hired for the job.

It's not usually the battery backup or the generator that fails. It's usually the switching equipment that has to go from mains to battery to generator to battery to mains. And doing it without causing a voltage sag on the generator.

Running a generator yard is just hard. You are acting as your own power utility with equipment that only runs during tests or outages. Running successfully at commissioning or during tests increases likelihood of service when needed, but is not a guarantee.

https://aws.amazon.com/message/67457/ (AWS: Summary of the AWS Service Event in the US East Region, July 2, 2012)

> On Friday night, as the storm progressed, several US East-1 datacenters in Availability Zones which would remain unaffected by events that evening saw utility power fluctuations. Backup systems in those datacenters responded as designed, resulting in no loss of power or customer impact. At 7:24pm PDT, a large voltage spike was experienced by the electrical switching equipment in two of the US East-1 datacenters supporting a single Availability Zone. All utility electrical switches in both datacenters initiated transfer to generator power. In one of the datacenters, the transfer completed without incident. In the other, the generators started successfully, but each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load and servers operated without interruption during this period on the Uninterruptable Power Supply (“UPS”) units. Shortly thereafter, utility power was restored and our datacenter personnel transferred the datacenter back to utility power. The utility power in the Region failed a second time at 7:57pm PDT. Again, all rooms of this one facility failed to successfully transfer to generator power while all of our other datacenters in the Region continued to operate without customer impact.

> The generators and electrical switching equipment in the datacenter that experienced the failure were all the same brand and all installed in late 2010 and early 2011. Prior to installation in this facility, the generators were rigorously tested by the manufacturer. At datacenter commissioning time, they again passed all load tests (approximately 8 hours of testing) without issue. On May 12th of this year, we conducted a full load test where the entire datacenter switched to and ran successfully on these same generators, and all systems operated correctly. The generators and electrical equipment in this datacenter are less than two years old, maintained by manufacturer representatives to manufacturer standards, and tested weekly. In addition, these generators operated flawlessly, once brought online Friday night, for just over 30 hours until utility power was restored to this datacenter. The equipment will be repaired, recertified by the manufacturer, and retested at full load onsite or it will be replaced entirely. In the interim, because the generators ran successfully for 30 hours after being manually brought online, we are confident they will perform properly if the load is transferred to them. Therefore, prior to completing the engineering work mentioned above, we will lengthen the amount of time the electrical switching equipment gives the generators to reach stable power before the switch board assesses whether the generators are ready to accept the full power load. Additionally, we will expand the power quality tolerances allowed when evaluating whether to switch the load to generator power. We will expand the size of the onsite 24x7 engineering staff to ensure that if there is a repeat event, the switch to generator will be completed manually (if necessary) before UPSs discharge and there is any customer impact.

Generator needs to go from 0% to almost 100% output within a period of a few seconds, UPS battery is often only a few minutes, long enough to generator to stand-up. There’s a reason why when you put your hand on the cylinder heads for that big diesel they are warm. Much like the theatre “You are only as good as your last rehearsal”.

The more concerning issue here is that their control plane is based out of a single datacenter.

A multi-datacenter setup, which, based on their stack, could just be jobs running on top of a distributed key-value store (and for the uninitiated, this is effectively what Kubernetes is) could greatly alleviate such concerns.

Kubernetes' default datastore, etcd, is not tolerant of latencies between multiple regions. Generally, vanilla k8s clusters have a single-region control plane.

This can just be multiple datacenters located close together (~100 km) similar to AWS AZs.

Fun fact, on certain (major) cloud providers, in certain regions, AZs are sometimes different floors of the same building :)

I like how clear Azure is on this:

> Availability zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters with independent power, cooling, and networking. The physical separation of availability zones within a region limits the impact to applications and data from zone failures, such as power and cooling failures, large-scale flooding, major storms and superstorms, and other events that could disrupt site access, safe passage, extended utilities uptime, and the availability of resources.

https://learn.microsoft.com/en-us/azure/architecture/high-av...

I think they expanded Tokyo but previously that was a single-building "region"

And it's virtually impossible to make, say, a Singapore region resilient to natural disasters

Did they claim Tokyo was more than one availability zone? If the 'tokyo' region was only ever claimed to be '1 availability zone' I think being in a single building technically still satisfies my quote above.

But yes, agreed.

Yes, in the API you got multiple AZs.

You may be obligated not to name them, but I'm not: Google.

AZ is a term used by AWS and Azure. GCP documentation makes it clear to "Distribute your resources across multiple zones and regions", where regions are physically different data centers.

That actually wasn't the one I was thinking of!

The latency is exceptional.

We run a k8s control plane across datacenter in west, central, and east US and it works fine.

I assume your site to site latency's under 100ms? If so that's fine.

What is PDX?

Portland Data Center I believe

Cloudflare's Portland datacenter. Most clouds/CDNs name their DC's after the airport code of the city they are in or are close to. Internally they would be listed as PDX, ORD, LAX, etc.

It's the IATA code for Portland International Airport. Many datacenters use IATA codes for the nearest airport to give a rough approximation of the location. So the PDX datacenter is the one closest to the PDX airport in or around Portland.

I assume shorthand for a datacenter in Portland. Labeling data centers by their cities airport code is something I see a lot.

Yikes. If still true, this feels like a significant single point of failure in their architecture.

IIRC there is supposed to be a failover to AMS.

DNS updates were also down during a good chuck of the outage. Most likely due to degraded API access since I'm guessing the dashboard UI relies on the API to make updates. But even when DNS updates managed to update in the UI, they still weren't propagating. I had to delete entire domains and recreate them after the outage in Cloudflare dashboard in order for SSL and DNS propagation to work again.

Generators, AC units, UPS’s and auto cutover switches are all mechanical susceptible to failure. You can do your best with maintenance and pre planning but in the end there is a non zero chance that a condition that you didn’t anticipate or couldn’t afford to account for will happen and your critical infrastructure will go down.

As of a few years ago, tiered cache and Argo are different things. I only wrote one of them, but in the marketing they are always referred to together. I think other than ENT users who can customize their tiered cache topology using the API it should be totally fine for the tiered cache topology to be days or weeks out of date. A lot of other stuff will have trouble, as you've correctly noted.

> Note that just before I'd left they'd been investing in standing up AMS (I think) but had never successfully tested a significant failover

IIRC, AMS was/is just a new DC to replace or augment LUX, which has existed for some time as, among other things, a failover secondary for PDX. But yeah, intention and reality drifted, as they always do; I had heard a variety of reasons why automated, or even break-glass, failover to a DC on the opposite side of the planet wasn't a tenable solution for many workloads.

They're big on chaos testing, and definitely (very regularly) test whole-DC network partitions. I'd love to be a fly on the wall in the meeting discussing why this event was so different.

I dunno. Cloudflare gives me the creeps. I have no idea why so many folks think large swaths of the Internet should be reliant on a single company.

No free (or even cheap) alternatives exist. If you have a little site that might be a DoS target, you have to use it.

How many little sites do you run that get hit by DDoS? I personally run about 10 tiny websites myself, some of them have around ~1-2K daily active users, but neither of them have suffered from any DDoS frequently nor do they use CloudFlare at all. One has been hit once by a DDoS that kept trying for ~2 days to bring the site down, but a simple "ban IPs based on hitting rate limits" did the trick to avoid going down, so wasn't a very sophisticated attack.

It seems to be a common misconception that people defaults to, that you have to use CloudFlare or some alternative, otherwise you'll get hacked/ddos'd for sure.

You could do this with anything. X hasn't happened to me, so I bet it doesn't happen to other people, so people who take measures against X are misinformed/cargo-culting (unlike me who is conveniently the smart one in my narrative).

Most services I've built that achieved any sort of traction have dealt with some sort of DoS including large fees when I've used CDNs like Cloudfront that are susceptible to a wget loop. I default to Cloudflare because it's the only one that actually covers my ass.

Cloudflare is so successful because the internet was built naively as if abusers would never exist. Just consider how IP address spoofing is still possible today and you'll begin to realize how broken the internet has always been long before you even get into dirt cheap residential smart toaster botnets.

> Just consider how IP address spoofing is still possible today and you'll begin to realize how broken the internet has always been long before you even get into dirt cheap residential smart toaster botnets.

I know I'm going to regret asking but, ok, I'll bite... why does IP address spoofing prove the Internet is broken? Especially considering that a) the point of internet routing is to route packets whenever possible especially around damage. and b) by volume, the internet is TCP and you can't complete a handshake with a spoofed ip.

With spoofed src addr:

* you can do all sorts of udp amplification attacks (e.g. dns - send a zone transfer request in a single packet with a spoofed IP, and the IP you spoofed to gets a lot of traffic in response.)

* you can do tcp syn or ack floods with a spoofed IP, these eat resources on the target machine. syn floods cause the os to allocate a new connection and timers waiting for the third ack.

* you can send lots of bad packets from a spoofed ip that causes automated systems to lock out those IPs as a response to attack traffic. If those lockouts block IPs that should be allowed (a type of denial of service)

And plenty more.

I'm not terribly impressed... this reads like a response from an LLM. Yeah I know what kind of packets you can send with a spoofed source IP... But the question was, given these are all decades old, how does that prove the Internet is broken?

The point of the internet is to provide a robust communictions platform. If fundamental infrastructure of that communcations platform can be abused to deny communications, and further that the abuse can continue with the root cause unaddressed for decades, then the platform is broken.

The fact that routing is designed to go around damage is orthoganal to this, and has not bearing on the fact that the communications platform can be used against itself to prevent communications (via spoofed IPs).

For literal decades partial solutions to the spoofing problem have been known - rp filtering would eliminate a lot of problems yet still isn't close to universal.

BGP has been vulnerable to all sorts of simple human mistakes for decades and decade old solutions like IRR are only slowly being adopted because many of the people that run the internet are too busy pretending they are important and good at building systems to actually make the systems good. When those same simple mistakes are intentional, all sorts of IP traffic can be spoofed including full TCP connections.

The fact that there isn't a widely supported way for the consequences of spoofing to be mitigated without paying out the nose for a 3rd party service is pretty broken too. Allowing destinations be overwhelmed without any sort of backpressure or load shedding is a fundamental flaw in "get packets to destination no matter what". An AS should be able to say "I no longer want packets from this subnet", and have it honored along the entire path. This should be a core feature, not an add-on from some providers.

The internet does work as designed, however it's folly to think that the first attempt at building something so different to anything that came before it is the best way to do it and reusing to address design decisions is fundamentally broken.

First, I want to say thanks for the interesting reply! it's refreshing to read good arguments on HN again :)

> The fact that routing is designed to go around damage is orthoganal to this, and has not bearing on the fact that the communications platform can be used against itself to prevent communications (via spoofed IPs).

> For literal decades partial solutions to the spoofing problem have been known - rp filtering would eliminate a lot of problems yet still isn't close to universal.

It's orthogonal, and yet of the places where it would actually matter or have the strongest effect, it's not used? I wonder why... 'cept not really. The Internet seems to still be functioning pretty well for something fundamental broken. For the vast majority of internet routers it's entirely reasonable for them to accept any source IP from any peer. because it is impossible to prove that peer can't reach somebody else. The exception is a huge number of endpoint ISPs who shouldn't be sending these packets and it's on them to filter them. I would love a way to identify and punish these actors for their malfeasance, but I'm not willing to add a bunch of complexity to do so.

> because many of the people that run the internet are too busy pretending they are important and good at building systems to actually make the systems good.

wow, that's a super toxic comment... and I'm an asshole saying that.

> Allowing destinations be overwhelmed without any sort of backpressure or load shedding is a fundamental flaw in "get packets to destination no matter what".

one mans fundemntal flaw is another's design trade off... every single system that has ever seen widespread adoption has defaulted to open and permissive. every. single. one. it's only after seeing widespread adoption does anything ever add in restrictions and rules and most often when it does it's seen as the enshittification of something. (most often then because exerting control allows you to vampiriclly extract more value). but dropping packets one system is overloaded is exactly what the internet does do, what you're describing sounds more like TCP working around it. (poorly admittedly)

> An AS should be able to say "I no longer want packets from this subnet", and have it honored along the entire path. This should be a core feature, not an add-on from some providers.

I could not agree more. but this is a missing feature not a fundamental flaw. the internet still works for the vast, vast majority of users and as I've said in a different thread the use or dependency on cloudflare is often a skill issue not a requirement.

You're 100% correct, core internet routing has many fixable defects. and many ISPs are moving slower than could be reasonably considered ethical or competent. But for something this core of infrastructure I would actually prefer slow and careful over the break everything on a whim because of the "move fast" mind virus that has overtaken CS.

> The internet does work as designed, however it's folly to think that the first attempt at building something so different to anything that came before it is the best way to do it and reusing to address design decisions is fundamentally broken.

It's also needless absolutist to say every defect is a fundamental design decision. The Internet was built to support trusted peering relationships. Where if someone was being abusive, you'd call your buddy and say "fix your broken script". The core need the internet is now supporting is wildly different, and this "fundamental design flaw", is actually just user error. If you strap a rocket engine on a budget sedan, it's not a design flaw when the whole thing explodes. If you're going to add untrustworthy peers to your network you also have to add a way to deal with them. that's a missing feature not a design flaw.

> IP address spoofing is still possible today and you'll begin to realize how broken the internet has always been

not that you need to have the answer to make your point, but now I am curious: what is the alternative architecture that prevents IP address spoofing? Wouldn't proving you are the IP you purport to be require some sort of authentication, which requires some centralized authority to implement? Which would require a fundamentally centralized internet?

> Which would require a fundamentally centralized internet?

Yes, that fundamentally central authority overseeing the IP address space exists today as IANA, which delegates to RIRs such as ARIN and RIPE, who allow ISPs to assert authority over address space cryptographically (RPKI) and/or in a central registry (IRR). This is the basis on which BGP announcements are typically filtered.

> what is the alternative architecture that prevents IP address spoofing?

It is possible to extend the same filtering approaches used with BGP to actual traffic forwarding without making fundamental architectural changes. See BCP38 (access) and BCP84 (peering). Widespread adoption of these would eliminate IP spoofing.

There's two parts to the answer to GP's question. One is egress filtering, which is widely deployed, and the other is BGP security, which as you know is being deployed.

Please do take note that I'm not saying anything like what you claim I said. I'm asking if it's something people commonly get hit by, as I myself haven't had severe issues with it.

I'm not saying others are misinformed or cargo-culting anything, just that I'm seeing lots of people who probably never get hit by DDoS in the first place (couple of visitors per day) adding CloudFlare by default as that's what they see everyone else is doing.

Of course if you do frequently get hit by DDoS attacks, there is nothing wrong with trying to protect yourself against it...

FWIW Cloudflare offers lots of useful services beyond DDoS protection—that's just one of them. Once you use Cloudflare for one service, it's nice to have all of your domains going through their DNS at the very least even if you were to bypass their stack.

Aside from ideological preferences or a preference for some other service, I don't see what you gain by avoiding them.

good luck with all those strawman replies...

It only takes one trivial event to make you a cloudflare zealot, most ISPs (web hosts) don't provide attack mitigation nor prevention. If you didn't know that you could ban IPs programtically you'd have been screwed* and would have loved how quickly cloudflare would protect your site.

It's not a difficulty thing, it's an ignorance thing. cloudflare isn't better,or magic, it's easy and popular. Just like social media, and aws.

For some people, i.e. most 'web developers' you will get hacked if you don't use it... because the alternative would be literally nothing. Not everyone wants to do things correctly, they only want it to work, working correctly is required.

What's the correct way to do things if I don't want to use Cloudflare for attack mitigation?

I'm not an expert in network security, so I can only give you an idea. The correct way to do things depends on so many factors, but the TLDR n probably use multiple ways to identify malicious traffic, and then block it as early as possible. Dropping all traffic from abusive IP addresses with netfilter is where you start because it's easy/simple. Then you can move on to grouping by ASN, and dropping closer to a load balancer.

The problem (as far as I know) that cloudflare actually provides, is being able to identify abusive networks, and making them prove they're not bots before they reach your server. If you can't even identify the abusive traffic, you don't have many options other than cloudflare.

> How many little sites do you run that get hit by DDoS?

Do you live in a country that has enemies? DDoS attacks are one of the primary weapons of cyberwarfare.

You may not have suffered an attack yet, but thinking the world is a peaceful place in which small players have no worries is naïve.

> Do you live in a country that has enemies?

Who doesn't?

> You may not have suffered an attack yet

But I literally shared in my previous comment that I have?

> thinking the world is a peaceful place in which small players have no worries is naïve

I agree, and I guess I'm lucky for not holding such opinion then.

I've never had any of my 3 houses burn down. I don't know why everyone says buy insurance, and having a fire extinguisher is a good idea.

It seems to me like a common misconception that you "have to have a fire extinguisher" (or other supression system), otherwise you'll have your house burn down.

I personally don't usually use it, but if I ran a Mastodon instance or blog or anything with real traffic and limited bandwidth and the slightest potential to piss someone off, I would. I am definitely not surprised by the amount of people using it.

Depends what you're counting as "little", but maybe your experience of 10 tiny sites has blinded you to the fact that sites for activist organisations, whistleblowing, investigative journalism, non-profits and so on, are very regularly targeted.

I wrote this before here, but my site (small b2b saas with a few 100 avid users from small-medium sized companies) gets hit by massive DDOSs a few times a year. The only way I can protect against that is CF bot fight. Everything else will just immediately kill the service until it's over. The last one lasted 24 hours; there were millions of requests from 100000s unique ips over that time; many ips from azure, gcp and aws. Why? I don't have a clue but with CF you simply notice nothing at all.

I cannot rate limit on the machines itself as they die immediately, so then I need to get more advanced firewalls etc which are vastly more expensive than CF.

OVH's DDoS protection works great and is included by default on all servers. It's blocked hundreds of attacks for us and the time to mitigate is only a few seconds.

Joshua Moon is laughing at you.

In principle I agree with this, but do feel this is said more readily about Cloudflare than other companies it could said about - such as Amazon (via AWS), Google and Microsoft.

Perhaps my own mental model is wrong, but I see them as a credible challenger to those very oligopolistic companies, and wish there were more Cloudflares.

I feel the same way. What about Akamai, Fastly, or Okta? Maybe Cloudflare gets more attention because their low end plans are accessible to anyone.

Its not just low end plans. Their pricing is basically the only one that feels fair. They don't charge you for bandwidth, unlike others that try to make on it as much as possible, while at the same time having other services also priced significantly higher.

They charge for bandwidth if you use enough of it on the enterprise tier.

+ (Global) Cloudflare Workers are amazing compared to Google Cloud Functions or other services that are regional, expensive and slow to start.

The difference is that nobody complains and most people agree when you talk smack about Amazon, Google and Microsoft. The general consensus is that they're big, dumb and knowingly evil, and most of the time their actions can be explained by that.

When we talk smack about Cloudflare, such as about their hosting of phishing, their underhanded DoH stuff, their complete lack of abuse handling, et cetera, lots of people come to their defense and make excuses for them.

You can like a company's product and still think the company is big and desires to be evil, but there's an emotional component for some that makes "us versus them" knee-jerk reactions more compelling than, "hmmm... is this correct?" evaluations.

I don't think any of these Cloudflare apologists would try to argue on facts that Cloudflare isn't trying to be a monopoly, isn't trying to recentralize the Internet, isn't marginalizing the rest of the non-western world, isn't trying to establish dependencies that people and companies can't easily escape, but if they did, that'd make for some interesting discussion.

To each their own, but I think this is said more frequently about Cloudflare because they are often playing the middleman, via their CDN service. In comparison, AWS and others are the actual origin.

> I have no idea why so many folks think large swaths of the Internet should be reliant on a single company.

It's not just the reliance, but the fact that cloudflare is a MITM attacker (by design) on vast amounts of TLS traffic. TLS if used properly gives you end to end security, but if you use cloudflare they have access to all your cleartext traffic.

Which can be said from any cloud provider/hoster.

How so? If I'm hosting a server somewhere and clients directly connect to my server to establish a TLS connection, failing any vulnerabilities in the implementation, there's no MITM happening and the provider can't see the plaintext traffic. (Of course, since the server needs the certificate, the provider could in theory extract that certificate and establish a MITM proxy, but this isn't by design.)

Any VPS or virtual server cloud provider can potentially see the plaintext traffic - it's in plain text of the memory of their hardware and they could be looking at it. They technically could be scraping your SSL keys from memory, or scraping your SSL private key from disk (if unencrypted storage) and then decrypting a mirror of the network traffic elsewhere. That wouldn't be MITM but you are only protected from it if you are hosting your own physical server somewhere.

"End to end security" mentioned above is limited security when "your" endpoint is owned by and controlled by someone else.

Here’s an example of MITM by interception of automated certificate renewal downstream of a VPS hosted at Hetzner. The presumption is that it was a lawful intercept installed within Hetzner or one of their internet providers. https://news.ycombinator.com/item?id=37955264

It's not an "attacker" by design, but certainly MITM.

> "I have no idea why so many folks think large swaths of the Internet should be reliant on a single company."

Who thinks that? Can you link to anyone who has said that?

Downvoted for "I am superior to <strawman>" comment.

> Downvoted for "I am superior to <strawman>" comment.

I didn't interpret their comment this way. To me, it read "this thing gives me bad vibes and I don't understand why so many people like it."

"I don't understand why people like it" is very different from "so many folks think large swaths of the Internet should be reliant on a single company". Take Chrome browser; it's fine to use FireFox because you think it's better, it's fine to use FireFox even though you think it isn't as good but you'll take the mild inconvenience on the principle that the internet shouldn't be dependent on a single company. It's also fine to use Chrome because you think it's better, but nobody - absolutely nobody, anywhere, ever[1] - who chooses Chrome does so because they think the internet should be reliant on a single company for a web browser and they want to support making that happen.

"I don't use Chrome, I don't know why everyone thinks the internet should depend on Google" is a strawman because nobody does think that; many people use Chrome despite thinking the exact opposite of that, even. Same with CloudFlare, it's free, it's convenient, it's very good at what it does (current outage excepted), it's widely known, easy to work with, has good support. Nobody who chooses it does so because they want to hand internet control to a single company. And "I don't know why people use (popular, well known, well made solution)" is a very common internet comment which communicates a certain message.

[1] people who work for Google are paid to think that, so their decision doesn't count.

You're right in the literal sense, but I don't think they meant that literally (maybe they did, who knows).

I personally haven't seen anyone praise the Internet being reliant on a single company. However, I have seen lots of praise for Cloudflare over the years and "naysayers" (such as people raising concerns about them MITMing half the Internet) being aggressively downvoted. In that context, I see it as many people tacitly endorsing Cloudflare and not caring about the control it holds, rather than people explicitly saying "Cloudflare _should_ control the majority of the Internet."

> I have no idea why so many folks think large swaths of the Internet should be reliant on a single company.

I don't think it's really that so many people think Cloudflare should be relied on. It's that Cloudflare generally has a good track record and their basic services are available for free. Actually, I don't know of any service similar to Cloudflare with such a generous free tier.

I think you're underestimating just how many devops and infra people dip into the mainstream tech news and adopt that as their new religion.

I have had multiple conversations with high level individuals about why we should be using CloudFlare so widely and what the fallback is if there is an outtage. Usually it boils down to "because reasons".

Today it's #notmyproblem and life is great.

I feel they have one of the most fair offerings, not trying to squeeze every last penny from you like other cloud providers. Just my opinion of them, probably wrong.

I don't think you're wrong.

Reference: CEO takes less than average salary... 600 k./yr. After > 14 years.

Note: as far as I know ( which is one of the many reasons I do "believe" in cloudflare), he is already wealthy and didn't start cloudflare for the money.

The facts seem to be supportive of that statement ( cfr. Compensation).

The previous time I was a fan of a company was AMD in 2014 ( somewhere around that time) and because of Lisa Su + product line up.

No idea?

They came along, out of nowhere and started offering their infrastructure, their global distribution network for free as a reverse proxy. This allowed people to scale their single-server-services out for nothing, it insulated servers from DDoS and single-government action. For the people that needed it, it took 10 minutes and revolutionised a lot of slow websites.

But then they started offering actual services. A formal CDN, largely for free but after that, pennies on the dollar of what major players were asking. And 6 years ago they started building your stuff for you, allowing you to host it near your users. They sell domains at cost, host DNS for nothing, and handle inbound email for you.

As a webdev, they're making my life very simple. Things that took me a day to bash out and bootstrap for a new client, I've done with CloudFlare while the client is on the phone.

If they vanished tomorrow, it'd be a wrench. But that's true of so much online infrastructure. Where would you be without Github, NPM, PyPI, dockerhub, etc? Enjoy it while we can.

What about them gives you the creeps?

Tbh. There are multiple objective reasons for me why they don't.

2 outages in the last week is 1 objective reason why they do, though.

The first one did get fixed in 30 minutes, which is probably some sort of record. I can't remember where any other cloud provider updated their status page even within 30 minutes ( or they hid it within their authenticated environment)

We use Cloudflare WARP (which is down) to access GitHub. Time to slack off.

If only there was a protocol of some sorts that allowed you to send/receive code and patches even if the centralized hub everyone uses for synchronization was down...

Repeat after me: GitHub is more than just Git. GitHub is more than just Git.

I would like to see Git be extended with a decentralized approach to bugtracking, code reviews, wiki's for documentation etc.

What do you need for wikis? Just put text files in a folder?

Bug tracking in git: https://github.com/MichaelMure/git-bug

Code review in git: https://github.com/google/git-appraise

Isn't that Fossil?

https://fossil-scm.org/home SCM, bug tracker, wiki...

Yes, it's also good a variety of different systems to store and address Markdown.

Which you could, you know, put in git.

Git is more than just GitHub though, that much is true.

Repeat after me: Git is more than just files. Git is more than just files.

I know git claims to be decentralized, but has anybody ever actually managed to use it in a decentralized manner? Not even the git or Linux projects themselves are without a centralized sync point.

Grandparent can still work off their local git repository, create local branches and commits, unlike the traditional VCS model which required branching and commits to be immediately synced to a centralized node.

The centralization issue that you raise is a different one; most projects intend to take contributions from folks and merge them into a single product.

Noted on the definition of the word “decentralized”, but the comment I was replying to was specifically about the ability to “send/receive code and patches”, that’s what I haven’t seen done without a central sync point.

Most of the time it's not about the code. Project management, issue tracking and prioritization, discussions, PR reviews, searching across the organization repos, etc.

There are (all too rare) tools that back those objects with git as well.

And there's always fossil ...

https://fossil-scm.org/home/doc/trunk/www/index.wiki

But it's not git. :-(

I wonder, have there been any case studies on phishing/trojan type attacks by code (git changes) via email?

Something distributed??

Why not just access it normally over the proper internet, instead of chancing it with Cloud Flare's janky wiring system?

WARP is much more than a VPN under the surface, it does device attestation and all kinds of integration with their Access product.

How does that help with GitHub though? I understand using this service as a proxy to access internal resources (which wouldn't be available on the public internet), but what good does it do to access a public service like GitHub? Does GH understand/rely on the (presumed) HTTP headers that this product sets on the proxied traffic?

Congrats!

Flexential PDX02 reportedly lost power around the same time this started.

Yikes, I'm curious how that happened. Our Flexential has 5 generators and a basement full of battery backups along with isolated zones.

https://puck.nether.net/pipermail/outages/2023-November/0149...

> Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet).

Our local test those like once a month. Looks like our account manager will get a quick check in after this settles to make sure we won't have this issue.

Testing those is SOP but it’s easy to do wrong: I’ve heard of multiple times where the tests had worked but the system failed in a true emergency because of some other factor.

The best one was probably the time Internap made a cheap shot at AWS about outages before hurricane Sandy hit, only to have their NYC data center go offline once their generators burned through the small amount of fuel they had near the generator because the main fuel tank pump flooded (AWS was completely unaffected). You just don’t tempt the DR fates like that…

Same. Local company had a NOC and power went out in the area. They had moved into the build a long time ago and had two backup generators at the back end of the parking lot already installed. Power goes out, generator didn't start. 2nd generator started but it was now undersized and immediately tripped the breaker. First generator tried to start and it failed again.

Another local company had a backup generator in the basement. They were doing maintenance work on the power and messed up so the generator kicked in. Since it was in the basement, they had a belt to drive an exhaust fan. Generator kicked on, fan belt snapped, exhaust filled the entire basement, rolled into the service elevator, starts coming out the 3rd floor (top of the service elevator) and the building was then evacuated.

Last generator story, small IT shop with a small colo in the 2000s moved into a new building and got a natural gas generator installed with it running a self diagnostic once a month. Power goes out in the first month, lights go out, generator kicks on, lights come back, lights go back off. Turns out that the electrician didn't have a 60amp breaker that day and put in a 20 amp breaker for a 60+ amp generator. They meant to come back out and fix it and never did.

> Generator kicked on, fan belt snapped, exhaust filled the entire basement, rolled into the service elevator, starts coming out the 3rd floor (top of the service elevator) and the building was then evacuated.

I feel like a ton of these stories come down to infrequent use. I remember one where they’d tested the generator but only for like a couple of minutes each month and so the fuel in the tank was really old by the time there was a real outage and the generator clogged expensively about an hour in, right as people were relaxing and thinking it’d be smooth sailing until the main power line was restored.

Which would you guess, cause or effect?

Seems related, Cloudflare just released this:

> Identified - Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

for example cloudflare tunnels will not work if restarted, our production is running with 1 single tunnel now.

Whats the reason for using tunnels and not just ip addresses?

You don't have to expose any ports to the internet, preventing people from finding and directly attacking your origin servers.

Only downside seems to be the Performance of tunnels in Containers. I use them for my personal Website, did a bit of Loadtesting and was able to get significantly more RPS without the CF Tunnel. Might be something on my end tho, not sure.

That's interesting. Cloudflare tunnels do a few things that I expected to to make it perform better in general: obviously TLS termination on CF's side where they likely have faster hardware doing that (at least faster than many customers), then the keep-alive sockets for tunnel<->CF, and I think they use UDP/QUIC for the tunnel<->CF connection[0] which I figure could remove some latency.

[0]: `lsof -i | grep cloudfl` shows me 4 UDP connections & 1 TCP

Makes firewall/ACL administration much simpler for one. Also makes it easier to hide and/or rotate origin IPs.

I hate that Shopify app has cloudflare tunnels ingrained by default. You can use other tunnels, i.e. ngrok, but it's a lot more manual with the setup.

Bad timing: quarterly earnings call is this afternoon.

Perhaps someone was rushing to release something that was going to be announced on the earnings call...

That was yesterday. They announced pricing changes for Cloudflare workers. I think it won't affect me but heavy workloads will have to pay more. They used to charge a flat fee but now are moving to $/CPU time.

Overall I still love Cloudflare and run all my backend stuff there. It just feels simpler and cleaner than AWS but it's slowly starting to get cluttered as more features are released.

They announced that on 28/09 fyi. It's not new and they didn't change a flat fee before.

They still have the 5 € / month fee.

https://blog.cloudflare.com/workers-pricing-scale-to-zero/

You're right. The email yesterday was that the voluntary opt in period to the new pricing was starting. Everyone will be switched over on March 1st.

"Cloudflare, Inc. Class A Common Stock is expected* to report earnings on 11/02/2023 after market close."

https://www.nasdaq.com/market-activity/stocks/net/earnings

This is not looking good. I really hope they don't move away from dogfooding because of this.

Also vercel deployments (and edge functions in general) aren't working. Their status page says: "We identified the root cause as an issue with one of our upstream providers and working with them towards mitigation."

I wonder if edge functions use workers under the hood lol.

LINK: https://www.vercel-status.com/

Digital Ocean too: https://status.digitalocean.com/incidents/bw3r3j9b5ph5

I don't know if I should be proud or scared that a lot of these products are just wrappers around Cloudflare offerings.

In this case I don't think its a wrapper, DOs App Platform is a bit too sophisticated to be on workers. I'm guessing a bunch of key workloads using workers.

I'm pretty sure edge functions are just workers with an little extra code for metrics and stuff.

Updated on the CF side:

> Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

Someone else in the thread shared this thread [0] which is about Flexential PDX02 losing power at the same time.

> Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet). Some utility power is back, so recovery is in progress for some portion of the site.

Sounds like the entire DC went down and their failover didn't handle it as gracefully as it should have.

[0]: https://puck.nether.net/pipermail/outages/2023-November/0149...

> I wonder if edge functions use workers under the hood lol.

Yep: https://news.ycombinator.com/item?id=29003514

someone will have their coffee break cut short at cloudflare, oops!

Many more will get an extended coffee break.

Some will get an eternal coffee break.

Tbh. Cloudflare is transparant with these kind of cases, they don't do "scape goating". When everyone in tech was firing, they didn't.

When an employee made a mistake, Cloudflare refused to name the person and said it was a mistake of the organisation ( was looking for the article, but couldn't find it).

During the last years ( lot's of companies fired lots employees because of rising interest, overhiring during Covid, ... ).

Contrary: The only time they fired employees in a larger quantity, was in Q2 2023 for the sales department, regarding underperformers ( performing 1/5th of the average). This was described here based on the earnings call: https://softwarestackinvesting.com/cloudflare-net-q2-2023-ea... and it's about 100 employees ( on > 3300), where they tried to re-hire others ( who got fired elsewhere because of the raised interests of the FED).

and extend other's coffee break, since their site is down and can't work.

The most impressive thing is: The status page actually works

Isn't it ironic that of all the services failing the status page never seems to go down? Of course it's likely hosted somewhere else, but I always chuckled and thought the same thing when there was a major outage somewhere.

I think that's less of a coincidence than due to the advice many people give that you separate your status page from the infrastructure it's reporting on in every single way possible. One example of the approach is the separate top-level domain usually used. If I were setting a status page up, I would ensure it was with a different cloud provider, and ideally a separate side of the country or separate continent from my principal DC.

why don't they just make the plane out of the status page?

"Powered by Atlassian Statuspage"

IMHO edge servers are a waste of time that expose conceptual flaws in the internet's design. We should have had free public proxy servers that relayed any publicly-accessible (public domain) static files automagically like Coral CDN, which is difficult to even find information about anymore:

https://en-academic.com/dic.nsf/enwiki/495553

And that caching should have happened in the ISP's backbone or user's devices so that stuff like municipal wifi "just works". Why are we all downloading the same Netflix stream over and over instead of using a protocol more like BitTorrent with a hash tree or content addressable memory using Subresource Integrity (SRI) for stuff like scripts and fonts instead of compiling it all into huge app.js files? Ridiculous.

Short of that, basically all servers should be using something like Varnish cache, which drops in with little configuration, leaving the main lag time the distance between countries:

http://varnish-cache.org/intro/index.html#intro

One Node.js instance can serve like a million users per month. And it only takes 65 ms for light to get anywhere in the world (12,000 miles divided by 186,000 miles per second). Maybe 200 ms if the speed of light is 3 times slower in wires and fiber optic lines. Yet tons of cloud apps run slower than 1990s cgi-bin servers.

Imagine how we could improve the internet's core technologies with even the tiniest fraction of a fraction of the tech sector's profits and any leisure time at all. This is perhaps the greatest disappointment of my life, watching all of the innovations that I thought would happen come about in a private/corporate fashion which weakens our freedom and privacy.

I think your point boils down to "vendor's edge cache are too far away and should be at the city level", right?

If so, major cities already have local Netflix caches in many ISPs, for example. Same is done for Google, Facebook and others. It takes peering contracts, colocation, etc, but it exists extensively.

Maybe you're arguing for government-sponsored caching infrastructure?

Well, sort of. I feel that something like Interplanetary File System (IPFS) should have been the original web. Or at least, after BitTorrent arrived, we should have all moved to content-addressable memories because they automatically handle caching and scaling. That would have required solving circles of trust with potential improvements to SSL certificates so that they don't require a central authority, like a distributed version of letsencrypt.org. It also would have needed improvements to DNS for making domain-data pairs instead of domain-IP. And at the very least, IPv6 should have improved UDP so that it "just works" everywhere and doesn't have to deal with NAT hole punching. That stuff represents the real work of rolling out a true P2P internet.

Since we did none of that, we cemented the centralized web we have today, full of walled gardens funding the copyright nanny state. So that people are so aghast at seeing the real internet running on stuff like TikTok that their first instinct is to ban it.

To me, it feels like we're living under a bizarro version of the internet, so far removed from its academic/declarative/data-driven roots that it's almost unrecognizable. Javascript is just distributed async desktop programming now, reminiscent of the C++ days of the late 1980s, complete with all of the nondeterminism that we worked so hard to get away from. Paid services like Cloudflare and 5G are a symptom of that inside-the-box thinking. If we had real distributed P2P through our routers and cell phones, we probably wouldn't need ISPs today, because we'd be accustomed to gigabit download speeds and would only perceive long distance as slow and worth paying for. In which case, maybe the government should pay to maintain the backbone and perhaps satellite internet like Starlink. Which would be a good thing IMHO, because at least there would be oversight and we wouldn't have to worry about billionaire narcissists making the satellites highly reflective and visible to the naked eye, like carving a McDonalds sign on the moon.

It annoys me to no end that the idea of metro-ISPs never took off. If somebody in the same city as me wants to connect to any services I host at home they first have to round-trip to my ISPs core-routers and then back to me.

Now this all happens very quickly within 5-6 ms but it could be 1ms or less in an ideal world. I understand why we cannot have nice things though (cost mainly).

Cloudflare has been down a lot this week. Technically, what I am thinking of was Hubspot was down but apparently it was related to Cloudflare issues.

Kinsta pages went down because of cloudflare taking our company’s marketing site with it for a good 5-10 minutes.

Supabase also went down I guess a few days ago because of Cloudflare.

It's now been 12 hours and they are still having issues.

I'm taking an online video course in music production and they host their videos on cloudflare stream and none of them work.

https://www.cloudflare.com/products/cloudflare-stream/

The single point of failure of half the internet continues to rear it's head.

It's now 24 hours and Stream is still not up.

> "These issues do not affect the serving of cached files via the Cloudflare CDN or other security features at the Cloudflare Edge."

https://www.cloudflarestatus.com/incidents/hm7491k53ppg

I thought this only affected the Dashboard (and functionality inside the dashboard).

Not anything else.

Is that not the case, since you're saying cloud flare video streaming isn't working?

Did someone re-commit this bug from the other day?

https://blog.cloudflare.com/cloudflare-incident-on-october-3...

E: nah, looks like a power outage in their brain DC. See other comments for details :)

Every time CF goes down, we get rewarded with an interesting article. At this point I’ve been pavloved into enjoying outages!

Hah, right! I do love an in-depth postmortem

Looks like Salesforce and Cloudflare sharing the same datacenter. Salesforce had the same issue today and they mentioned similar power failure as the root cause.

It is a known fact that salesforce is running its "Hyperforce" in AWS.

Does that mean Cloudflare also using AWS?

Does anyone faced any issue on AWS resources today?

Judging from the AWS status page it seems that all services are up and running without any issues.

https://health.aws.amazon.com/health/status

Cloudflare, as far is I know, doesn't use any AWS products.

And yet their share price is up over 5% today

https://uk.finance.yahoo.com/quote/NET?.tsrc=applewf&guccoun...

they reported strong earnings yesterday:

https://finance.yahoo.com/news/cloudflare-inc-nyse-net-q3-14...

Not just the API - I'm now suddenly having DNS troubles with one of my domains. I think maybe only domains/sites that use their "configuration rules", or some specific feature like that.

Happened to me yesterday

Cloudflare really embarrassed me:

I had attempted to do an ad hoc demo for my client in a morning meeting. Whoops.

oof sorry to hear that

Edge— “These products are impacted at the control plane / core level, meaning that only the changes to the existing configuration are affected, but the product is functioning at the edge …”

Same PDX outage has been affecting Workday's WD5 dc, too.

https://www.cloudflarestatus.com

> Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

Oh, no! The lava lamps are off, there's not enough entropy!

https://www.cloudflare.com/learning/ssl/lava-lamp-encryption...

I've been trying to bring up a Github Enterprise instance and this is giving me a lot of grief (wanted to proxy it through Cloudflare DNS). It's been going for quite a while... I'm really quite shocked at the length of recovery time needed. Currently can't view analytics to find out why my instance isn't working.

And their quarterly earnings reports are due today. Ouch.

Meanwhile CloudFlare stock is up 10% today. lol

Looks like it's taken out Digitalocean App Platform deployments too: https://status.digitalocean.com/incidents/bw3r3j9b5ph5

I think most of us can forgive pretty much any mistake but for a service company, there are two golden rules:

1) Know that you have a problem quickly. We reported an issue yesterday and it took them 3 hours to acknowledge that there was a problem, causing us to waste hours trying to make sure nothing on our system was causing it

2) Always have a quick way to roll back changes or mitigate a problem. I think they were having some power supply issue but it is still a business risk that should have been solvable quickly.

There are certainly not the only ones but when you keep blogging about your company's technical prowess, this makes you look stupid.

On the day of their earnings call...

Two outages in the last week now.

EDIT: To the dead comment response, this has nothing to do with programming language choice.

It's 60 degrees in PDX; the power outage doesn't seem weather related.

I managed to log in last night but there wasn't much use to doing so. Once I was in I could barely run a domain search in the registration app. Errors were being thrown on every dashboard.

Cannot seem to access the web UI also.

"Sure, why not use cloudflare provider rather than simply hardcode values into our IAC, what could go wrong?"

Hands completely tied.. crazy how long this has been ongoing.

DreamHost having issues at PDX since 4:54AM PDT...

https://www.dreamhoststatus.com