return to table of content

Microsoft technical breakdown of CrowdStrike incident

rdtsc
207 replies
21h58m

We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability.

Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products.

Reducing the need for kernel drivers to access important security data.

They are being as diplomatic as they can, but it's definitely a slap to CS. Read as "they don't know how to roll things out, they need guidance on basic QA practices, we'll happily teach them...". Then, they list a set of facilities running in user-mode to avoid needing to run as many things in kernel mode.

I would be interested what the water cooler discussion about CS was like inside Microsoft. Especially in teams needed to respond to customers about "Your windows OS is broken, our hospital patients are suffering...".

f001
68 replies
20h26m

I can tell you they’re quite unhappy about it. Have a friend working there who frustratedly says it wasn’t their fault every-time it comes up. Which is quite often and at every social occasion since.

fishywang
65 replies
19h48m

but it's kind of their fault? they designed the api that way, they decided what can be done in userland and what must be done via kernel. they at least _allowed_ it to happen every time.

freeopinion
19 replies
15h35m

When a parking valet takes a car on a joy ride and crashes into a tree, we could blame the tree. We could blame the car owner for handing over the key. We could blame the auto manufacturer that didn't provide a "valet mode". We could blame the police for not detecting the joy ride before the crash.

All of these parties could do better (stupid tree!). But the real problem is the valet.

We can say that it is obvious that the electronics-heavy cars of today should anticipate rogue valets and build in protections. But we shouldn't let rogue valets off the hook for damages.

As a consumer, you could choose to only purchase cars that have "valet mode". So should we blame consumers who don't? If so, we should blame the airlines, hospitals, etc.--not Microsoft.

How about we prosecute valets unless they refuse to park cars that don't have "valet mode"?

Proziam
14 replies
15h1m

You could also prosecute the establishment that keeps a valet with an abominable record on staff.

Microsoft took no steps to force-eject them from their ecosystem, despite their long history of issues.

freeopinion
7 replies
13h34m

Just to be clear within the analogy: are you expecting the auto manufacturers to "force-eject" any hotel on Park Ave that has a record of valet mishaps? Or did you mean individual cars should force-eject the valet?

If a Caesars Entertainment property in Macao has enough incidents, should GM update the firmware on their automobiles to force-eject valets at Caesars Entertainment properties in Las Vegas?

Now imagine that GM actually operates valet services in Macao and Las Vegas. Should they be allowed to force-eject valets from competing services?

I am not a Microsoft apologist. I think they should do better. I think Linux and FreeBSD should do better. I personally avoid Microsoft products. But I place more blame on people who use MS products than I do on MS. After all, I never intend to hand my beat up old Corolla over to a valet so why should I have to pay for a "valet mode" feature that Toyota is forced to build into all their cars? Isn't it reasonable that motorcycles, 18-passenger vans, and scooters don't need "valet mode"?

In my book, the auto manufacturer is lower on the list of culprits than the valet, "the establishment that keeps a valet with an abominable record on staff", and the vehicle owner. But some place like Car and Driver could definitely prioritize encouraging GM or Toyota to develop valet modes over berating owners; so I don't mind a place like HN shooting a few arrows at MS. Unless the general public follows their lead and lets bad guys off the hook by shifting too much focus to somebody lower on the list.

mejutoco
5 replies
11h11m

Just to be clear within the analogy: are you expecting the auto manufacturers to "force-eject" any hotel on Park Ave that has a record of valet mishaps? Or did you mean individual cars should force-eject the valet?

Not OP, but I think the analogy here is the hotel "fore-ejecting" (firing) the valet with a history of doing joy rides. That seems very reasonable.

lucianbr
3 replies
10h39m

In the analogy, it seems Microsoft is a car manufacturer. The hotel is the company that bought software from CrowdStrike. The problem is that Microsoft should not control who has access to which APIs, that is a huge can of worms, and actually called anticompetitive by the EU from what I understand. At MS level, either they publish APIs or not. If published, anyone should be able to write software for them. This is especially bad if MS themselves also sell security software that uses the same APIs. It would literally mean MS deciding who is allowed to compete with their security software.

mejutoco
2 replies
10h7m

I think it works better (please allow me to change it) if Microsoft is the hotel. Crowdstrike is the restaurant inside the hotel. The restaurant is serving poisoned food to the guests, who assume it is a decent restaurant because it is in their hotel.

Also the restaurant has their own entrance without security and questionable people are entering regularly, and they are sneaking into the hotel rooms and stealing some items, breaking the elevator.

At the same time, the hotel is in a litigation process with the restaurants association, because in the past they did not allow any restaurant on their premises. The guests, naturally, do not care about this, since their valuables have been stolen, and they have food poisoning. The reputation of the hotel is tarnished.

PretzelPirate
1 replies
5h19m

if Microsoft is the hotel

I don't think this works since Microsoft isn't the hotel. The hotel in your example chooses which restaurants are inside, but Microsoft doesn't. In this example, Microsoft is the builder who built the hotel building for a 3rd party. That 3rd party decides which restaurants it wants to partner with, as well as any other rules about what goes on in the building.

If the builder came around and made changes to ban the 3rd party's restaurant partner, that would cause a ton of issues and maybe get the builder sued.

Microsoft can't decide what can and can't run on their platform - the most they can do is offer certification which can't catch everything, as we just saw with Crowdstrike since they decided to take a shortcut with how they ship updates. Microsoft also had to allow for equal API access so they don't get sued by the EU.

mejutoco
0 replies
1h14m

Operating system (hotel) decides which programs run in kernel mode (Crowdstrike) but ok. Let me address the other point.

Again the reasoning of allowing equal API access to avoid getting sued is a false dichotomy: Microsoft could choose to make an OS that would not need such mechanisms to be simply usable.

They could also remove their own crowdstrike-alike offering, so that it would not be considered anti-competitive. They could also choose not to operate in EU. Of course, that would lower their profits, which is the real motive here.

Once you sum it up the reasoning goes: hospitals/flights can stop working because a company cannot lower its profits, and said company is not to blame at all. It is clearly false, the rest is sophism, and back-bending arguments IMO.

Proziam
0 replies
3h38m

This is the correct interpretation. I am surprised that people took it in different directions.

Proziam
0 replies
3h23m

I'm expecting restaurant owners to fire bad valets.

Or in Microsoft's case, via regulatory, social, or software, prevent Crowdstrike from causing harm to their customers.

I'm aware it's a sticky regulatory situation, but CS has a history of these failings and the potential damage could be severe. Despite this, no effort (that I am aware of) was made by Microsoft to inform customers that Crowdstrike introduced potential risks, nor to inform regulators, nor to remove the APIs CS depends on.

I don't believe Microsoft is solely responsible, but I do believe that throwing all of the blame for the very real harm that was caused onto CS alone is missing a piece of the puzzle.

Last aside, every large corp has team(s) focused on risk. There's approximately zero chance they didn't discuss CS at some point. The only way this would not have happened is negligence.

rk06
1 replies
13h48m

Can Microsoft legally ban a competitor for percieved incompetence? I doubt it . partiuclarly seeing how much competence is shown with windows and MS teams software

sim7c00
0 replies
10h40m

Microsoft assigns driver levels to these guys etc. and allows them to load kernel mode components as protected etc.. If they do not allow that - CS cannot cause such damages. ofcourse, as you pointed out, this will then turn into some lawsuit blaming MS for killing competitors, even if they do it to try and protect their customers.

wonderful world.

Dylan16807
0 replies
10h14m

Microsoft was required to let them have the same access their own software used. Which seems fair to me. Microsoft can remove those APIs entirely, they just can't restrict them.

seanmcdirmid
0 replies
11h31m

Microsoft took no steps to force-eject them from their ecosystem, despite their long history of issues.

I’m pretty sure anti trust law doesn’t allow Microsoft to go anywhere near that kind of action, even if they wanted to be more Apple like.

Ekaros
0 replies
9h58m

Problem is that the establishment here is well the establishment. That is the state itself. Or at least one of them. As somehow MS is in position where for any slight anti-trust thing they will be prosecuted. Our system is setup to allow these actors in...

naasking
2 replies
8h48m

All of these parties could do better (stupid tree!). But the real problem is the valet.

No, the operating system is supposed to provide secure access to hardware and isolate independent subsystems so they can't interfere with each other. That's its whole purpose for existing. The fact that people feel they need to deploy CS is a Microsoft failure. Windows is just not a secure OS.

mynameisvlad
0 replies
2h13m

You’re shifting practically the entirety of the blame to a company that at best was an accomplice to the issue.

I get that you hate Microsoft, but not everything is their fault and it’s disingenuous to pretend otherwise.

ing. The fact that people feel they need to deploy CS is a Microsoft failure.

CS is also available and widely deployed on Mac and Linux. Is that a failure of Apple and all the distros? It literally took down Debian and Red Hat systems earlier this year, is that also not CS’s fault?

kasabali
0 replies
5h30m

The fact that people feel they need to deploy CS is a Microsoft failure

They don't need to deploy shit. Only reason it's deployed because it's a whole racket.

goosejuice
0 replies
2h49m

You could also choose to park the car yourself or plan for a secondary mode of transportation if something happened to your car.

Not the best analogy. The organization who deploys said software is responsible for the uptime of their systems. They didn't have to use CrowdStrike and if they do they should have a plan in the event of failure.

skissane
12 replies
19h21m

they designed the api that way, they decided what can be done in userland and what must be done via kernel

They didn’t have much of a choice - it is very hard to get adequate performance with real-time filesystem filtering without doing it in kernel mode. Not aware of any other mainstream OS which succeeds at that.

And they kind of had to provide this feature, since they’ve supported it since forever (antivirus vendors were already doing it back in the days of MS-DOS and Windows 3.x/9x/Me), and there is a lot of market demand for it. It is easy for Linux to say “no” when it never has had support for it (in official kernels)

But, as the blog post points out, it sounds like CrowdStrike is doing a lot of stuff in kernel mode that could be done in user mode instead - whether due to laziness or lack of investment or lack of sophistication of their product architects

they at least _allowed_ it to happen every time

Microsoft, in allowing third party code to be loaded into their kernel, is no different from other major OS kernels, such as Linux or Apple XNU.

Apple is (increasingly) the most restrictive about this, and a lot of people criticise them for it.

Even Linux imposes some restrictions-which kernel symbols to export (at all or as GPL-only)—although of course being open source, you can circumvent all restrictions by changing the code and recompiling

fsociety
11 replies
18h52m

Mac and Linux run EDRs in userspace without an issue. No one here has an excuse or no choice.

dralley
9 replies
18h49m

Linux these days tends to use eBPF which isn't really in userspace per-se.

djbusby
8 replies
18h43m

eBPF is like the Twilight Zone. I'm in kernel space but, I'm not.

speed_spread
3 replies
17h43m

eBPF is Linux denying the fact that it's turning into a microkernel and that Linus was wrong.

markmark
2 replies
11h59m

If you're right for 30 years in tech you're right, even if things eventually change.

skissane
1 replies
11h20m

The famous Tannenbaum-Torvalds debate happened all the way back in 1992. At the time, the most common microkernel was Mach, which had significant performance problems. NeXT/Apple solved them by transforming Mach into a monolithic kernel, making Mach (as XNU) one of the most popular kernels in the world today (powering iPhones, iPads, Macs, etc). But that doesn’t help Tannenbaum‘s side of the argument. And I don’t believe his own Minix did much better than Mach did.

Whereas, from what I hear, L4 and its derivatives have solved this problem in a way that Mach/Minix/etc could not. Yet still, it makes me wonder, if L4 has really solved it, why aren’t we all running L4? L4 has had some success in embedded applications (such as mobile basebands, Apple Secure Enclave); but as a general purpose operating system has never really taken off.

sidewndr46
0 replies
4h46m

from what I understand a huge number of computers run Minix, but only in the Intel Management Engine

LtWorf
3 replies
17h19m

Well they crowdstrike crashed a kernel with it

skissane
0 replies
17h11m

Apparently that wasn't (entirely) CrowdStrike's fault: https://news.ycombinator.com/item?id=41030352

Whereas this Windows outage rather obviously was.

eBPF being able to crash the kernel is usually sign of a kernel bug. And it sounds like in this case it was even a bug specific to Red Hat kernels, introduced by a Red Hat patch.

That said, even if they are triggering a Red Hat kernel bug, CrowdStrike should be testing their software adequately enough to pick up that issue before customers do – and it sounds like they haven't been

pclmulqdq
0 replies
16h55m

That was more of a kernel bug than a crowdstrike bug. However, it's clear that they are pushing what you can do in kernel space to the limits, which is not a great sign.

IsTom
0 replies
4h36m

Isn't being able to crash anything with eBPF is a bug in either kernel or eBPF? As I understand it's supposed to prevent exactly that.

feyman_r
0 replies
16h48m

Can you re-read the list (source Wikipedia) in one of the comments in the tree? It had Debian And RedHat issues listed on different dates.

lozenge
11 replies
19h43m

You can't just let people do anything from userland, the performance would tank. As for restricting kernelland, EU competition regulators would not be happy if MS was the only one able to write anti virus software that runs in kernelland.

ahepp
5 replies
19h9m

You can't just let people do anything from userland, the performance would tank

Isn't the point of userland that you can (try to) do anything from there?

It seems like MacOS and Linux provide substantially safer alternatives that are still performant?

As for restricting kernelland, EU competition regulators would not be happy

I keep seeing people say this. Is there a basis for that assertion, or is that mere speculation? Again, hasn't MacOS already deprecated kexts?

philistine
1 replies
16h25m

Well Microsoft did not publicly commit to using the same APIs, and no privileged access, for its own antivirus products. That's why the EU said no way; not because kernel access was revoked.

guiriduro
0 replies
10h12m

Yes, but then of course Microsoft is being obligated to open part of kernelspace to competitors, which is arguably "OK" from a competitive regulation perspective, but that then places a special burden on competitors to maintain code hygiene given the potential for crashes. It makes CrowdStrike's negligence all the more unacceptable.

pjmlp
0 replies
10h31m

MacOS still keeps the kexts support around, even if the long term roadmap is to move everything into userspace.

112233
0 replies
16h19m

What are the Linux alternatives you are talking about?

justinclift
4 replies
19h31m

[flagged]

throwaway237289
1 replies
19h11m

[flagged]

dang
0 replies
50m

Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.

(Your comment would be fine without that first bit.)

https://news.ycombinator.com/newsguidelines.html

hilbert42
0 replies
14h36m

There are ways around this that I've discussed elsewhere so I won't repeat them here.

However, think of it this way: Windows restarts, tries to load with new patch and crashes.

Question: why can't Windows be designed so that on crash it automatically restarts and loads the previous state sans patch?

Answer: Windows could be designed that way but it would require Microsoft to do many things it doesn't want to do. Some of which would require Microsoft to go back to the beginning and reengineer quarter-century or more old code from scratch, that means redesigning APIs and the underlying architecture from first principles.

Why doesn't Microsoft want to do this? It's obvious so I won't bother to spell it out.

Nevertheless, when the dust fully settles and someone outlines these alternative design strategies in great detail then it'll be obvious to everyone what a fragile stack of cards Windows has been constructed on.

nilamo
8 replies
19h8m

Your car _allows_ you to drive off a cliff. If you do so, it is your fault, not the fault of the car manufacturer.

Kind of weird that anyone is blaming Microsoft for any part of this, imo

wokwokwok
3 replies
18h56m

Mmm… meaningless analogies are kind of meaningless?

More like:

If you install a security product that then prevents your car from starting; are they entirely blameless for letting you install it?

If you pull the hood up, tear off the “voids warranty” seal, ignore the “don’t open this” labels, crack the seals open and shove something into the engine… sure.

…but if you just slap a widget with the “vendor approved” sticker on your dash and it bricks your car; that’s a bit sucky right?

I do feel Microsoft is not entirely blameless in this.

It should be easier to recover from this kind of thing.

They should have been paying attention and made a fuss that one of the biggest security vendors has been doing this literally since they started.

I would bet money that until two weeks ago Microsoft was high-5ing them for best security practices.

It’s not “their fault” but they can’t just go “wasn’t us!”.

It was them.

It wasn’t macOS. It wasn’t *nix.

Suck it up. They should’ve done better.

krige
1 replies
11h10m

Except Crowdstrike had 3 separate Linux incidents, including kernel panics, directly before this happened.

happymellon
0 replies
3h38m

And at least one of them was actually a Redhat kernel bug, where eBPF caused a kernal panic when it shouldn't be able to?

prmoustache
0 replies
10h35m

That is the problem: you feel.

Before Microsoft comes into the picture the issues is crowdstrike pushing updates without proper testing, selling a product on which customers cannot control the update schedule, and customers for being so naives and not checking what the product they install on critical stuff do.

fishywang
3 replies
19h0m

The big difference is that CS is not the user. In you analogy it's like your car allows you to drive off a cliff, and an (almost) essential part of your car (for example, the pedal) drives the car off a cliff.

vel0city
0 replies
17h44m

CS is not the user

It got there because a user or administrator approved and installed it. It didn't just appear there, Microsoft didn't install it there. The user ran it.

nilamo
0 replies
18h10m

Right, so a slightly better analogy would be if you wanted to install a remote starter, but then you find out that they can only be installed into Fords, because other auto manufacturers (Apple, Linux in this case) believe that tampering with the critical path (the engine, kernel) is unsafe. It isn't Ford who's at fault for allowing you to run some random engine modification, it's that mod that is at fault.

jayd16
0 replies
18h47m

If it's a custom after market part, how can you blame the car manufacturer and not the part maker?

a-dub
7 replies
19h12m

i would have thought that in 2024 a bad driver update is something that windows would automatically roll back.

or at least provided some level of protection against crashes in third party kernel code.

sashank_1509
3 replies
17h13m

No you can’t roll back bad driver updates in any OS, if you could then by definition they do not sit in the kernel space. You just want the security code to not run in kernel space, which is a decision MS could maybe make and become like Apple, though most security software would in that case rebel.

fragmede
0 replies
16h25m

it depends on how bad. in Linux you can rmmod to get rid of the bad one if you haven't wedged it and fix your code, compile, and try again. I can't imagine that's actually different on windows if you know what you're doing. how do you think driver development happens?

a-dub
0 replies
16h56m

No you can’t roll back bad driver updates in any OS, if you could then by definition they do not sit in the kernel space.

drivers and kernel binaries are typically installed and maintained by user space programs that run with some sort of elevated privileges.

"kernel space" is just a runtime context, what gets loaded into there typically comes ordinary (protected) files on the disk.

Dylan16807
0 replies
10h45m

That doesn't make any sense.

The OS loads file A into the kernel. It crashes. It reboots. It decides not to load file A this time.

Wow, it's a rollback of kernel-space code.

Unless your argument is that you can't guarantee a rollback of every possible kernel driver, because it might have installed a rootkit while it had full control? Okay, cool, but this isn't a malware removal idea. It's an idea for normal drivers.

VohuMana
0 replies
17h44m

I think if I understand the systems right Windows can roll back a bad driver update but the CS update wasn’t an update to the driver but instead updated a configuration file which CS updated outside of Windows Update. So from the Windows Update perspective the system started failing to boot with no changes to the system. Again though I don’t know if I totally understand what CS did and what capabilities Windows Update has.

TiredOfLife
0 replies
9h58m

It was not a driver update.

scarface_74
0 replies
16h44m

Microsoft tried to lock down kernel access in the Windows Vista era. Antivirus vendors went crying to the EU and they forced Microsoft to allow access to the kernel to third parties.

Iwan-Zotow
0 replies
19h31m

it's like userland video driver - thousands context switches per second, performance will dive...

999900000999
0 replies
18h41m

An OS flexible enough where you can do something stupid enough to completely break it.

Basically IOS which is so locked you can't even run apps not expressively approved by Apple.

Pick one. If I build a bike and you remove the breaks to save weight don't get mad at me when you crash.

thejournalizer
0 replies
4h54m

Honestly most of the conversations were about getting everyone back online.

mns
0 replies
11h0m

I noticed this at work and in some other contexts last week. We weren't affected by this, but most of the people that brought this up, even technical people (other fields, not security or OS or anything like that), think that this was a Microsoft and Windows issue. they all seem surprised to hear that Microsoft wasn't the root cause of this, and they all seem surprised, because no one knows or understands what Crowdstrike is or does.

holsta
52 replies
21h29m

they need guidance on basic QA practices

Microsoft has a loooong history of botched (security) updates, so I'm not hopeful they can teach Crowdstrike much.

Rinzler89
44 replies
21h21m

Do you happen to have a list of that "loooong history" of botched (security) updates?

I can only find a couple of examples after googling, which a bit smaller than a "loooong history" you're talking about, so unless Microsoft is paying Google to delete results, maybe you're mistaken.

SoftTalker
28 replies
21h16m

This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes. Anybody who was paying attention knew that you didn't use any new Windows release until at least the first service pack had come out.

Granted that was a while back but painful memories die hard.

Rinzler89
20 replies
21h14m

>This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes.

That was WIndows XP 20 years ago. Please bring arguments about modern Window 11 security which is the current up to date product they're selling and supporting not scenarios that haven't happened in 20 years.

Eduard
12 replies
20h33m

for a loooong history, you have to look in the past

Rinzler89
11 replies
20h26m

Ah, well, if only things of the past were useful today, I'd still have hair, and probably millions made form right investments, but unfortunately, it's what's happening today that actually matters.

echoangle
6 replies
19h53m

So you asked for proof of a long history and are now surprised that the examples are all from the past?

Rinzler89
5 replies
19h24m

How does that impact the present? If it's no longer as vulnerable today, why would I care about the past? The point is learning from mistakes and fixing them so that doesn't happen again.

echoangle
3 replies
19h21m

If it doesn’t matter to you, why did you ask? Are you just trying to win an argument or are you being intellectually honest? Because you asked for proof of the long history someone claimed. You could have just said “the long history doesn’t matter because I only care about the current state”. That’s fine and valid, but don’t ask questions and then shift the goalposts if you don’t like the answers.

Dylan16807
2 replies
10h35m

A "loooong history" needs to have a timespan of many years.

So yes it would start in the past, but it then has to continue for a long time.

Pointing out that a company was bad 20 years ago isn't enough. You need to show they were also bad 15 years ago, and 10 years ago, and 5 and/or 25 years ago.

So complaining that the only evidence was so far in the past is valid. The original goalposts were not reached. (Well, someone in another part of the thread eventually listed every google result for a windows update making anything crash, but that doesn't really establish that microsoft is "botching" updates at a level significantly above background noise, which I think was the original intent.)

echoangle
1 replies
10h2m

Well someone posted examples from XP and someone else posted 4 botched updates in 2023, do you need a list for every year inbetween?

Dylan16807
0 replies
9h33m

Was my implication of "every 5 years" not clear? But I already mentioned those links, they're pretty weak. I'm not calling an update that for a few people makes a handful of games crash "botched", when the original implication was quite juicy botching.

Also, if we're actually getting into this, the XP gripe had nothing to do with updates. That's moving the goalposts half a mile in the other direction.

albedoa
0 replies
17h41m

why would I care about the past?

??? You specifically asked for it! What are you doing.

squigz
3 replies
19h38m

GP is absolutely correct. You can't ask for examples of a long history of something, then dismiss examples from, you know, history.

Rinzler89
2 replies
19h21m

Fair enough, but if those examples are irelevant to modern times, what's the point of bringing them up? If we want to keep the discussion relevant to modern context then let's discuss modern history, not obsolete news from 20 years ago.

squigz
1 replies
19h8m

What is "modern history"?

lucianbr
0 replies
10h34m

A period of time where Microsoft has no mishaps, of course.

TeMPOraL
2 replies
20h0m

Recall actually is a brilliant idea, and I dreamed of something like it for a long time, and so did plenty people here. It's just not something you can trust a third-party business with, whether it's a fly-by-night startup or an international megacorporation known to be openly promiscuous with advertisers.

This is basically "take a screenshot every 30 seconds and compile it into a timelapse", but on steroids, and the same appeal, and arguments wrt. who gets to run it on whose machines, all apply.

dahdum
0 replies
19h27m

If you keep your business and personal computing separate, Recall looks amazing.

clwg
0 replies
19h46m

The functionality does seem intriguing, that doesn't change it's security profile which was poorly thought out and implemented.

feyman_r
0 replies
19h47m

Ignoring Windows Insider reports is bad. However, how many endpoints having issues (out of a billion+) is ‘acceptable’ after an update? We live in a news hype cycle so clearly even the one wrong failure will make it up somewhere.

However, without metrics that show BSoDs from patches (which MS will likely never share), it’s hard to see if things have improved or regressed. If they regressed, someone up in their leadership chain is hopefully following the constructive discussion here.

tacticus
0 replies
20h32m

The company that let every db server have global admin creds and 0 logging on their cloud platform?

That didn't run their own enhanced visibility on their own cloud platform.

TeMPOraL
5 replies
20h3m

That's a bit disingenuous, though. That was, as 'Rinzler89 points out, some 20 years ago. Back then, any Linux distro would've definitely been much safer option, because after installing you couldn't even connect it to the network, because it had no support for your cable modem or wireless card, and that's assuming you didn't fuck up your MBR with LiLo for the 20th time. Ask me how I know.

Both OS families have changed much since that time.

commercialnix
2 replies
18h25m

In 2002 I wasn't yet even out of middle school when I had a Linux distro running all key hardware components "just working". At that time at my school we were taught how to search the web, so I searched the web and looked up what hardware worked. Very simple. All I had to pitch to my parents was, "this system shares its code and encourages me to study it and learn code", which made clear to them what I was asking for wasn't just another video game console. Soon after I had a refurb laptop (fortunately not x86) and a curated WiFi card that ran Linux (and soon after, BSD) with everything "just working".

When I see someone complain about unsupported/unsupportable chips in comments on online forums, especially one dubbed "Hacker News", I am puzzled how I in my middle school years acted out a pattern that is objectively smarter* than what I read in such comments. I also happen to first-hand know I am for sure not the only one with this vantage point. Those who comment about unsupported/unsupportable chips as if it is somehow an open source kernel's fault might want to take a moment to consider how others, and how many others, are viewing such drivel. For every one of us who take the time to point this out, there are 10,000 of us experiencing utter contempt, like as if we just got an unexpected whiff of some hot garbage.

[*]And, I honestly don't think I'm even that smart.

fragmede
1 replies
18h12m

you got lucky with the hardware. there was a bunch of wifi cards that wouldn't work in Linux because there were no drivers. and then ndiswrapper came along and let you use windows drivers in Linux. now that was a user unfriendly procedure of getting it working. some chipsets eventually got native drivers like ralink or b53 but getting things working was not easy!

commercialnix
0 replies
15h7m

There was absolutely zero luck involved. As I already wrote in the previous comment, I did something very simple. I sought out a WiFi card that already had Linux drivers and then purchased that WiFi card. I didn't have to "do anything" to get the WiFi card working.

rvnx
0 replies
19h51m

Oh sweet, this laptop has a PCMCIA Wi-Fi card!

That'd be cool if one day I can get the laptop running on battery and not just on sector.

Let me just setup it.

Wait a second, how do I wake up the screen again and get out of this hibernation stage ?

Why are all the fans stuck in 100% now ?

Errr, first let's see if I can get the trackpad working.

lupusreal
0 replies
19h19m

On please, if it were that tough then teenage me never would have managed it. 20 years ago, e.g. 2004 (I first installed it in 2001), installing Linux and getting networked was already user friendly. The only hitch I ever had was figuring out ndiswrapper, but my ethernet cards all worked "out of the box" and installers handled the bootloader without users even having to know what a bootloader was. It's not like 20 years ago was the 90s or something, and the dark days of Windows lasted well into the 00s.

feyman_r
0 replies
20h2m

Agree.I also remember those days when it was so hard to get Linux to just boot up and get your display working correctly- it was almost like a rite of passage. It was just proving grounds for how much of an expert you were and the number of hours you spent in front of the PC, just to get things working.

My point is, good and bad memories will always stand out.

GordonS
8 replies
21h11m

There's only been a few really bad ones, but Microsoft botch Windows updates quite regularly.

Rinzler89
7 replies
21h10m

>but Microsoft botch Windows updates quite regularly

OK, please show us the proof then. If it's as regularly indeed like you claim then it must be documented somewhere as a greppable list.

Tech blogs would have a field day getting traffic on their site by keeping track and documenting on such regular mistakes if they exist.

feyman_r
2 replies
19h58m

Where can I find a list for all OSes? I’d assume such a list would have known issues with X11 etc. I want to ensure it’s not a case of surviviorship bias.

oxygen_crisis
1 replies
14h47m

I don't think there is one... macOS doesn't have enough functionality-breaking updates to make a significant list, and Linux/BSD-based distros generally do cleanly segmented updates to individual apps and services rather than Microsoft's great big monolithic all-or-nothing OS update bundles that touch on dozens of services at the same time.

feyman_r
0 replies
13h44m

Here’s a quick 2 minute search on Google for each.

- https://www.macworld.com/article/671831/macos-wont-install-f...

- https://askubuntu.com/questions/1231849/how-to-fix-update-pr...

My own anecdote: When I got my M3 Pro in April and had to start afresh, it was stuck in a restart loop and had to take it to the Genius Bar; they asked me to answer ‘no’ to some question that I was answering differently. That was it. I have no idea on the root cause or why it was fixed this way. I don’t remember the exact screen where the answer was supposed to be different.

Brybry
2 replies
20h33m

It's frequent enough that people pay money for AskWoody[1] to tell them when it's safe to patch or what patches to skip.

[1] https://www.askwoody.com/ms-defcon-system/

Rinzler89
1 replies
20h23m

Quote, from the website:

"In general, I apply Windows Defender updates as soon as they’re available. Why? Microsoft hasn’t screwed up any of them too badly. You’re better off applying those updates than letting them slide for a week or two."

Brybry
0 replies
20h1m

Yep, Microsoft does a good job with Windows Defender (antivirus) updates.

It's the other Windows Updates that they botch frequently enough to make people wary of patching immediately.

system2
4 replies
21h16m

Anyone who worked in IT knows this, it is not something rare. Literally every month, for example one from last month:

https://www.techradar.com/computing/windows/windows-11-updat...

This is the main reason every IT professional I know disables auto updates of windows and manually trigger updates after testing (hopefully) on multiple dummy machines on the network.

I personally remember booting to safe mode to remove Windows updates to rescue the computers more than I can count.

Rinzler89
3 replies
21h12m

Examples like that one I also found, but that's not really a "looooong list". If people can only show one single example as an argument it's kind of a moot point.

system2
2 replies
20h26m

You'd experience at least 3-5 per year if you work in IT. There really is a long list but since it is not my argument, I won't list them after searching for an hour. The list starts early 2000s, not recent.

EDIT: Whatever, I will do the search for you since you cannot use google:

https://www.pcgamer.com/an-odd-bug-in-this-months-windows-10...

https://www.windowslatest.com/2023/10/22/windows-11-october-...

https://www.bleepingcomputer.com/news/microsoft/windows-10-e...

https://www.windowslatest.com/2023/02/09/microsoft-confirms-...

https://www.windowslatest.com/2023/07/16/windows-11-kb502818...

These are just the last quarter of 2023. There is over 2000 news but I won't link them Use keywords: Windows Update, Crash, and use the date option on google go before 2023.

Rinzler89
1 replies
19h28m

All you could find were 4 examples in 2023? Hardly a long list, wouldn't you say?

I think my Android updates caused way more issues in one year and that's running an immutable HW that's well know and understood by the manufacturer, so 4 issues per year for Windows doesn't sound too bad, even though I had zero in 2023.

drdec
2 replies
20h14m

> they need guidance on basic QA practices

Microsoft has a loooong history of botched (security) updates, so I'm not hopeful they can teach Crowdstrike much.

Experience is the best teacher

psychoslave
0 replies
7h6m

Attention to teacher is not equal between learners, trying to thoroughly assimilate the lesson is not everyone move, self challenging oneself with actual tests to ensure skill acquisition is rare, and going through the whole rabbit hole to figure out what untold assumptions the teacher leverage on and understanding the limits of these suggestions is the way only a few exceptional beings will follow.

justinclift
0 replies
19h30m

Is MS doing it properly these days though?

If they are, then you could be right. :)

cogman10
2 replies
19h56m

And they've learned a lot from it. For example, MS no longer universally deploys updates across the world, they have a slower rollout to avoid just such an incident.

sunaookami
1 replies
13h49m

Yeah now one million users loose access to their computer instead of 100 million!

fragmede
0 replies
13h46m

yes? that's 100x better! at the end of the day, internal testing just isn't going to catch every single permutation of customer configuration, so there's always a risk that something bad goes out. if you're that big, you'd start with .01% of the fleet instead of 1% of the fleet, so it's 100_000 before you get to 1_000_000, before going to 100% but neither Apple or Google have figured out a better way than that. It's industry standard at this point.

SoftTalker
0 replies
21h22m

Yes, quite the epitome of throwing stones from a glass house.

notepad0x90
33 replies
21h43m

I must disagree with that take, your last quoted sentence is in response to all the supposed self-proclaimed experts asking "why does it need kernel access", the ones before that is to limit their own liability.

What I've heard from people in the industry is not this silly "oh no, crowdstrike is so incompetent" b.s. that is being spread on sites like HN and reddit but more of an empathic "it could have been us" sentiment. In this write up as well, Microsoft knows they have caused their share of outages, it is a technical write-up but in part, it is to cover their bases for government investigations and lawsuits that will arise from this incident.

And in part, they are also responsible for recovering from third-party driver errors and repeated boot failures caused by faulty drivers.

retrochameleon
29 replies
21h36m

CrowdStrike blamed their test software, but in the same breath revealed that they haven't been using any canary deployments. The bug that caused all this was present in their kernel driver for a long time.

For being such a large cybersecurity player and deploying updates to 8.5 million devices, their quality control practices are embarrasingly lacking.

mort96
13 replies
21h32m

Every company I've ever been at rolls out updates slowly. Rolling out a change to 8.5 million computers at the same time seems ridiculous. Even the most cash strapped start-ups with every incentive to cut corners tends to get staged roll-outs more or less right. It's crazy.

geon
6 replies
20h56m

I had a fleet of only maybe 200 computers I updated remotely. I did canary staged roll outs.

notepad0x90
4 replies
16h2m

not a software update!

mort96
3 replies
8h30m

Not relevant!

notepad0x90
2 replies
6h34m

details are always relevant in a technical discussion. look at my other comments where i pointed out microsoft performing similar immediate av signature updates and causing chaos.

mort96
1 replies
2h23m

Some details are relevant, some are not.

I'm more than comfortable labelling parts of Microsoft as incompetent as well.

notepad0x90
0 replies
46m

We can agree on that, but it is relevant because this isn't an unusual practice. Crowdstrike didn't ignore some pre-existing best practice. Lots of things need improving but facts and details matter when you talk about RCA. it isn't about blame but fixing the root cause.

doubled112
0 replies
20h22m

When I managed ~ 15 developer’s Arch Linux workstations, I found it very beneficial to be the canary, and then rollout to a couple of the more capable of troubleshooting devs, and then the rest. I can always fix my own box.

8.5M all at once feels insane.

notepad0x90
4 replies
16h3m

again, this is why I was snarky in my earlier post, this was not a software update. they should have used canary deployments still but in many cases prior to this incident, it was not acceptable to wait even a few hours because it can make the difference between companies getting ransomwared/hacked, so they focused on making the actual code/driver that interprets the channel file updates robust enough to handle real-time updates. Even if other players were doing canary deployments with behavioral detection updates, they're not the market leader, crowdstrike is for a reason.

Everyone that worked in an operational incident response role has blocked some indicator like an ip address or a domain. you don't do gradual roll outs for those either, and i've seen people cause outages by skipping a check or making a mistake. this is similar in many ways to that except it was for a named pipe. This could probably have waited for a canary deployment, but in general the class of content that is being deployed would be deployed right away, I'd be surprised if their practice is considered "bad" by any measure. I've seen Microsoft also deploy email quarantine signatures and defender updates that caused large scale impacts.

Here is a link of what Microsoft did earlier this year:

https://www.techradar.com/news/google-chrome-not-working-mic...

If they had canary deployments, that wouldn't have happened. I had rules that were causing chaos because of that. Now imagine if defender had a bug that caused it to crash because of a signature update. The impact would be magnitudes greater than what you saw with Crowdstrike. It's really frustrating to see the lack of technical critical thinking and arm-chair experts acting like they know what they're talking about.

mort96
2 replies
8h30m

Let's say the driver was "robust enough" to handle a broken channel file. How would that look exactly? Say you're responsible for writing the code which loads a new channel file. These channel files are critical; without them, your security critical product doesn't know how to do its job. The channel file parser returns a parse error. How should the driver respond? Surely you're not going to just silently disable your security critical product if someone puts a bad channel file in there?

PleasureBot
1 replies
4h15m

Delete the file or mark it as corrupt so that the parser doesn't keep trying to read it, and send some telemetry back to CS to indicate there is a problem with the one of the channel files. It doesn't seem very complicated at all. There are plenty of options in between "catastrophically crash the OS" and "silently disable the entire product".

mort96
0 replies
2h24m

That seems pretty dangerous if that channel file included security critical configuration, which it presumably did

Dylan16807
0 replies
10h19m

it was not acceptable to wait even a few hours

Hours... Wouldn't a 15 minute canary have found this problem about 14 minutes before it hit wider deployment?

binkHN
0 replies
21h22m

Beyond crazy. I even have a small app that never makes it to production before being rolled out to internal and open testing first. And, even then, it's slowly rolled out to a percentage at each stage before being fully deployed. One would think a major company with kernel level access would do this at minimum.

rvnx
9 replies
21h34m

Clearly incompetence to deploy from 0 to 8 million devices without any gradual rollout.

That goes even further, because apparently they were fully blind and didn't have crash metrics.

"Ok we push the update, and pray".

galangalalgol
7 replies
21h29m

I think it is past incompetence, and on into negligence. Given the stories we have heard here about emergency service failures it is likely that people died. When people die due to negligence isn't that usually criminal?

SoftTalker
3 replies
21h19m

Who is negligent though? Crowdstrike, or the emergency services that are using an OS that requires third party endpoint security right out of the box in order to be safely used, or the company that makes and sells that OS?

crazygringo
1 replies
21h8m

Why not both?

Crowdstrike, for negligently not rolling out updates gradually.

And emergency services, if they don't have robust fallback procedures/systems for when their IT system goes down. I mean it's totally fine if regular doctor's visits get postponed, but 911 should never go down just because their computers down. Just like aircraft have redundant systems, so too should 911.

(The company that makes and sells the OS -- I don't see any negligence there, in this case. If security software fundamentally requires running at the kernel level and Microsoft allows that, I don't see how Microsoft can be at fault.)

jmb99
0 replies
20h43m

Yeah, I don’t see how one can blame Microsoft in this scenario. If you choose to run buggy kernel-level code, that’s on you, not the publisher of the kernel/OS. Especially when the code you’re running is a replacement for functionality already provided by the OS. It’s hard to argue that MS could be negligent for “not having a good enough AV/endpoint protection solution” or “allowing customers to run kernel-level code.”

Aeolun
0 replies
18h55m

It’s hard for people to understand that these massive ‘security’ enterprises are often connected by a large amount of bodies instead of competence.

rvnx
0 replies
21h29m

Can't agree more, you found the right words.

notepad0x90
0 replies
15h58m

https://www.techradar.com/news/google-chrome-not-working-mic... ,not an unusual practice and they were not first av company to cause outages. and again, it was not a software update, the buggy software was deployed after testing back in march. Details matter!

How about we let the lawyers figure out who had what liability, just like with the av/edr industry, we should know when the subject matter is outside our area of knowledge and expertise.

binkHN
0 replies
21h20m

And this is how the lawsuits will start.

notepad0x90
0 replies
16h1m

I shared with a sibling commenter:

https://www.techradar.com/news/google-chrome-not-working-mic...

Did Microsoft do a staged or canary roll out with that? This is not a software update, if you're making such comments then you're speaking about something outside of your field of expertise.

duskwuff
2 replies
21h31m

CrowdStrike blamed their test software, but in the same breath revealed that they haven't been using any canary deployments.

Their post-incident report [1] also stated that they intend to improve testing by "using testing types such as: local developer testing". One has to wonder what, if any, testing they were doing beforehand.

[1]: https://www.crowdstrike.com/blog/falcon-content-update-preli...

MBCook
1 replies
16h0m

Well we know what the testing is, don’t we?

The update literally crashed the system it was used on.

There’s no way they couldn’t know that unless they never ran it. Right?

Is this one of those things that only happened to 10% of users? Because I haven’t seen that reported anywhere.

duskwuff
0 replies
12h35m

Is this one of those things that only happened to 10% of users? Because I haven’t seen that reported anywhere.

As far as I'm aware, it affected all systems using Crowdstrike.

lupusreal
1 replies
19h11m

Unless their developers had room temperature IQs or were actual psychopaths, I really wonder how they even managed to find developers who had the nerves to deploy to the whole world all at once like that. If it were me I'd be scared shitless, covered in sweat and probably shaking too hard to even type. Were CrowdStrike developers too stupid to even realize the magnitude of what they were doing? Or did they have cooler nerves than an open-heart surgeon? It's shocking to me that they could have done this so casually.

Aeolun
0 replies
18h54m

Were CrowdStrike developers too stupid to even realize the magnitude of what they were doing?

More likely they were following a playbook to the letter, and were therefore 100% of success.

michaelt
0 replies
21h27m

Anyone in the industry could have a bug get through testing.

Some companies could have a severe and readily reproducible bug get through testing.

A few of those companies have a hand-rolled update mechanism, and can accidentally break their ability to roll back a bad release.

A few of those companies are in a position to push a release that breaks not only their own software, but the entire OS.

Very few companies in that position would roll out to 100% of client machines in a single worldwide deployment.

gjsman-1000
0 replies
21h32m

Microsoft should be sued, for literally having blood on their hands. There was an easily mitigated design flaw in Windows that would have greatly blunted the impact.

https://news.ycombinator.com/item?id=41095788

freehorse
0 replies
21h27m

If "it could have been them", then I would like to read such professionals write exactly about how to avoid having a global outage like this again, rather than "showing empathy" with a corporation. Or do we just leave it up to luck, and if "it happens to them too" in a month or year, oopsies? What about which practices could be improved?

nimbius
29 replies
18h5m

this isnt even the first time its happened. Crowdstrike has killed an OS every month for the past four months.

At this point they are a threat actor. if you havent kicked their amateur-hour software out of your infrastructure by now, chances are good senior management and engineering have at least considered it formally.

https://en.wikipedia.org/wiki/CrowdStrike#Severe_outage_inci...

metadat
20 replies
17h53m

That incident list is damning. Is senior leadership asleep at the wheel, or how can this many incidents possibly happen every 30 days for months on end? If leadership really cared, they'd make sure post-mortems and other best practices are in place to reduce the frequency.

Unfortunately, the executive disconnect isn't new. It's actually uncommon that they care about the reality for end users and customers (which is antithical to my entire ethos, hence why I get paid the medium bucks). Why bother waking up and going to work everyday unless you are contributing in some way to sustaining a better future for everyone? It's actually great for marketing and it's already going to be a tough 100+ years from today for our children, even with our collective care.

P.s. People can be so selfish, it kind of breaks my brain but not really. Have you seen the CO2 emissions visualization from NASA this week? It was a wakeup call for me.

'Tremendous' NASA Video Shows CO2 Spewing from US into Earth's Atmosphere https://www.newsweek.com/nasa-video-carbon-dioxide-co2-emiss...

It's concerning.. and caught no traction.. http://news.ycombinator.com/item?id=41064029

swasheck
17 replies
14h46m

here’s a fun connection: https://x.com/anshelsag/status/1814426186933776846

“ For those who don't remember, in 2010, McAfee had a colossal glitch with Windows XP that took down a good part of the internet. The man who was McAfee's CTO at that time is now the CEO of Crowdstrike. The McAfee incident cost the company so much they ended up selling to Intel.”

so yeah, “leadership” (and that’s a loose term) doesn’t seem supremely concerned about much more than earnings

hinkley
8 replies
13h50m

The fish rots from the head.

Also what the fuck is a sales-facing CTO??

cratermoon
3 replies
13h18m

I'm suspicious of CrowdStrike now. If we rip the cover off would we find that it's little more than a reskin of McAfee?

hinkley
0 replies
3h3m

Sometimes it’s good to take a little break after working for a company that ended up not representing your values.

I’m on #2 now and it’s been great. It’s like a breakup. “What was I thinking?”

Of course if it is representing your values and your values are purely mercenary, it’s really not going to change anything.

graycat
0 replies
10h50m

The Internet is able to transmit odors of rotting flesh????

Recently ordered an HP laptop for some light work (not my startup), and when placing the order said don't include McAfee, that "I don't trust them", all just from some odor!

CloudStrike runs in kernel mode? No wonder there are problems; kernel mode sounds like more of a threat than a protection.

Sooooo, for my Web server(s), McAfee and CloudStrike are issues I get to ignore. Problems avoided and time, money, energy saved!! Simple.

kermatt
0 replies
3h21m

Also what the fuck is a sales-facing CTO??

Perhaps of a symptom of the "Everyone is in sales" brain damage so pervasive in companies now.

joshstr
0 replies
2h27m

Have seen region-specific Field CTO roles partner with GTM teams to co-sell with customers. Product and role domain expertise without the organizational technology responsibility.

Wytwwww
0 replies
6h56m

I assume T stands for [Sales and Marketing]Technology. Which makes perfect sense because these are their core departments that the whole company is dependant on.

The product itself is a secondary cost-center, probably less important than even accounting.

Terr_
0 replies
11h40m

a sales-facing CTO

Is that what happens when a company has so many Sales Engineers that they become a parallel department from regular Engineering?

shmeeed
5 replies
11h19m

Now that's interesting. I wonder why neither here nor there anybody mentions GK's name. Fear of litigation?

IMO somebody who managed to collapse the most important infrastructure on earth twice in as many decades - not a small feat, I have to admit - should be known by name to the general public, lest he'll get another chance at it.

prmoustache
3 replies
10h41m

I haven't seen any important infrastructure on earth collapse, neither in 2010, neither in 2024.

LadyCailin
1 replies
9h11m

Tell that to the people whose surgeries were cancelled because of computer issues.

prmoustache
0 replies
8h54m

That was still not on of the most important piece of infrastructure on earth.

And outages were not as global as news outlets made it look to be. Crowdstrike may have been ubiquitous in some countries, but almost absent in others. And still, crowdstrike or windows windows aren't global pieces of infrastructure.

shmeeed
0 replies
7h54m

I admit that was a bit of hyperbole. My point stands regardless.

rbanffy
0 replies
9h8m

Tech needs something like the FTC that can ban someone from working in that area after multiple demonstrations of glaring incompetence. Or evil misdirection of competence.

Wytwwww
1 replies
6h59m

Is senior leadership asleep at the wheel, or how can this many incidents possibly happen every 30 days for months on end?

Presumably it doesn't matter that much and isn't worth spending money/manpower on?

If the usefulness/quality of their software has no influence on their potential customers decision making process. why bother?

It would make much more sense to allocate any excess resources to the departments that do actually matter like sales and marketing.

maerF0x0
0 replies
4h57m

Presumably it doesn't matter that much and isn't worth spending money/manpower on?

Well, if they think any of the $20B of shareholder value lost recently has to do with the quality issues... Then perhaps they should reconsider. (keep in mind marketcap also represents their ability to raise capital in the future with more/less dillution)

wannacboatmovie
1 replies
11h17m

From your linked article:

A Hacker News user claimed that

Nice to see Wikipedia has devolved even further into a dumpster fire in that they are now citing random HN posts as authoritative sources of facts.

majewsky
0 replies
5h48m

Wikipedia is not an individual actor or a hivemind, so there is no capital-T "They". It's a system of multiple people each acting on their own accord. For a developing news story like this, I find this type of sourcing acceptable, especially because it is cited as "some person on the internet claims", not as "it is true that".

If you disagree with this choice of source, you can flag this part as needing better sources. The simplest way to do so is to just leave a comment on the talk page.

surfingdino
1 replies
10h34m

Never assume malice where incompetence will suffice. I have worked on teams where we could not get the basics like a test or integration environments signed off for months yet the managers expected us to go to production. Suffice to say production was also not signed off for half a yer and we had to improvise. I wonder is something similar was at play at CS?

gregw2
0 replies
7h44m

Never assume incompetence when greed will suffice.

hinkley
1 replies
13h52m

Staffing problems?

Management often sees, “I have a dozen people on this.” When in fact the bus number was three, you laid one off, another quit and the third is sick or having life struggles.

kermatt
0 replies
3h19m

"I have a dozen people on a dozen different things."

whiplash451
0 replies
1h12m

Or maybe crowdstrike is dealing with the hardest threats and hence ends up having to rollout stuff rapidly against zero-days?

Not a CS fanboy, but just wanted to suggest an alternative to sheer incompetence

jgalt212
0 replies
8h15m

this isnt even the first time its happened. Crowdstrike has killed an OS every month for the past four months.

Yeah, but doesn't MS have to sign every kernel mode driver? They've allowed Crowdstrike's foot gun to continue to live in the kernel.

gnfargbl
7 replies
21h30m

It didn't read as particularly diplomatic to me. In particular, this paragraph..

> It is possible today for security tools to balance security and reliability. For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement limiting exposure to availability issues. The remainder of the key product functionality includes managing updates, parsing content, and other operations can occur isolated within user mode where recoverability is possible.

...was about as close to tetchy as a post like this would ever get. Basically they are saying "there was no good reason at all why CrowdStrike had to put so much code inside the actual kernel." And with the benefit of hindsight, it's a strong point.

ffhhj
6 replies
21h13m

there was no good reason at all why CrowdStrike

Their business is corporate spyware to surveil employees, ofcourse they'll use any tactic to make it work, that's the why. And their EULA states there is no liability for the company:

https://www.crowdstrike.com/terms-conditions/

Dirty policies on top of dirty practices.

Rinzler89
4 replies
20h53m

>Their business is corporate spyware to surveil employees

What?! Anything you do on your corporate provided laptop is always gonna be logged by IT for security in every large company everywhere, that's news to nobody, but your company doesn't care that you use your corpo laptop to book your vacation, IT has better things to do than narc on you for that.

If your boss wants to actually spy on you they don't need Crowdstrike, there's other SW dedicated for that depending on the laws in your jurisdiction but that' not what Crowdstrike is for.

If you want complete privacy from your employer, just use your personal machine for your private activities instead of your work laptop, why is this so hard?

userbinator
2 replies
20h36m

Speak for yourself. There are still companies who don't treat their employees like idiots and actually trust them. Let's not normalise pervasive surveillance.

Rinzler89
0 replies
20h35m

>There are still companies who don't treat their employees like idiots and actually trust them.

Yeah sure, but wow many of those are large non-tech companies?

You massively overestimate the tech competency of the average PC user if you think it's normal in most companies to not have security monitoring solutions in place or over the internat activity. In our latest phishing test IT did, several users fell for the trap, despite it being a tech company. There's always gonna be someone careless one day and companies want insurance policies against that.

Having such solutions in place doesn't mean the company doesn't trust you, it's more like that old Russian proverb, "trust but verify", and for ticking security compliance boxing as an insurance policy.

Everyone makes mistakes, it's only human. So more like, speak for yourself, if you think your internet activity at work isn't logged anywhere.

Aeolun
0 replies
19h21m

I think there’s an inflection point where the company has grow so big it becomes impossible to trust every individual employee.

It won’t be about distrusting anyone in specific either, but something will go wrong for which you need to be monitoring every PC to find out what is going wrong.

heraldgeezer
0 replies
19h29m

Yep, there are better tools for spying, like Teramind and Aktivtrak.

heraldgeezer
0 replies
19h29m

There are better tools for spying like Teramind and Aktivtrak. Crowdstrike would make a bad spying tool. I guess there is remote CMD? And you can like, see all installed programs.

But so can SCCM/Intune from MS or another RMM like Datto that IT uses to manage PCs...

blackoil
5 replies
16h56m

MS should have something like Project Zero for Windows applications and drivers. Any app on more than 1-5% PC should be tested and fuzzed and ... And the vendor than pressured into fixing the issues. Even if it is not technically their fault, it is definitely optics problem for MS, half of the world refers it as Windows blue screen issue.

MBCook
2 replies
16h4m

And the vendor than pressured into fixing the issues

How would Microsoft apply pressure? Short of publicly shaming them what power do they have?

blackoil
1 replies
14h17m

umm. Give a x days deadline and make after it public like Project 0 works, threaten to take away "Verified by MS" badge or create a WhatsApp group of Fortune 500 CIOs and badmouth in it.

9dev
0 replies
7h59m

Both of these have legal percussions: Microsoft could very well be called a competitor of CS, so they cannot force them to do something without getting accused of abusing their market position; and a publicly traded company badmouthing another publicly traded company with an awfully complex web of mutual investments is a very bad idea in general.

It’s not that easy.

naasking
1 replies
8h52m

People wouldn't need CS if Windows was better designed to begin with...

rty32
0 replies
2h49m

Care to elaborate?

How would a better designed Windows eliminate the business & compliance need for installing software like CS? And why hasn't that already happened?

I would think Microsoft and CS' customers have an incentive to not have such third party software on their system if possible.

lupusreal
1 replies
19h29m

Why are they being diplomatic, instead of plainly stating their contempt and revoking CS's driver/etc signing keys? Doing so would help to repair the reputational harm that CrowdStrike inflicted on Windows.

Are their lawyers telling them they can't impede CrowdStrike even though CrowdStrike is breaking Microsoft's product? They should do it anyway and dare CS to take it to court so they can publicly humiliate CS by dragging all the dirty details of their incompetence out.

Aeolun
0 replies
19h15m

People are free to install kernel modules. It shouldn’t be up to microsoft to stop them from doing so.

oneeyedpigeon
0 replies
10h56m

From the latter:

However, nothing in that undertaking would have prevented Microsoft from creating an out-of-kernel API for it and other security vendors to use. Instead, CrowdStrike and its ilk run at a low enough level in the kernel to maximize visibility for anti-malware purposes. The flip side is this can cause mayhem should something go wrong.

The Register asked Microsoft if the position reported by the Wall Street Journal was still the IT titan's stance on why a CrowdStrike update for Windows could cause the chaos it did. Redmond has yet to respond.
thebytefairy
0 replies
3h42m

It's a little ironic they are taking the high ground on safe rollout practices when they had an Azure/365 outage caused by a bad config at the same time as the CS incident. Though to be fair, it only affected US central.

gjsman-1000
21 replies
21h53m

Reminder that Microsoft could have programmed Windows to notice if a driver has caused a blue screen three times in a row, and prompt if you want to disable the driver on boot. After all, Windows already collects how many times a driver causes a crash. This would have made recovery one click instead of heading into Safe Mode and needing BitLocker keys.

But they didn’t.

And Microsoft, I argue, also has blood on their hands for every hospital this hit. Giving users a prompt to disable the driver, after three successive failed boots, would have saved lives.

t-writescode
8 replies
21h51m

How would that have helped the server farms that were experiencing the issue?

gjsman-1000
4 replies
21h48m

Oh I don’t know, the servers down, you go and look as a technician, and you simply see a screen saying:

“CSAgent.sys has caused a failure to boot three times in a row. Do you want to disable this driver? <Yes> <No>.”

You click “Yes.” Server reboots with CloudStrike driver disabled. The day is saved in 5 minutes instead of building a custom ISO image or going on a BitLocker key recovery spree.

politelemon
3 replies
21h45m

It would still have required on site presence and interaction during which there is still downtime, so this accomplishes marginally small gains.

gjsman-1000
2 replies
21h42m

At the same time though, imagine you woke up and CloudStrike hit your organization.

For most users, they’ll try clicking “Yes.” And then it’s back to work. After all, “No” just causes a blue screen again, might as well try the other path.

This would have been the difference between the IT department handling 10,000+ calls or a few hundred (plus sending out a bulletin) in many, many organizations. It also could have saved billions at this point.

Heck, it would have saved lives in hospitals.

jonathantf2
0 replies
20h28m

But then you have millions of endpoints booting without malware protection

echoangle
0 replies
19h39m

Can you cite some reports of deaths caused by the outage?

morkalork
2 replies
21h37m

Instead of prompting on the screen, disable the driver and boot directly into a recovery state that has networking enabled so sysadmins can push scripts and fixes? As long as it's not a network driver you'd be okay.

t-writescode
1 replies
20h7m

Disable the driver that is explicitly there to protect from malware and attacks?

Wouldn’t malware just use that as an attack vector?

morkalork
0 replies
19h22m

Nooo you don't understaaaand kernel code is special :'( actually BSOD was a desired feature because CrowdStrike is a Security (TM) application
crazygringo
3 replies
20h57m

Do I like your idea for that?

Yes, absolutely. It's a clever idea.

But do I think Microsoft was negligent in not building that?

No, I think that's going too far. Windows already has Safe Mode -- as you note -- to allow for manual recovery, which is what people are using.

I don't think it makes sense for it to be Microsoft's legal responsibility to protect its users from software with a critical bug that wasn't written by Microsoft. Otherwise, where would it end? If a third-party program tries to delete all your user data, is it Microsoft's legal responsibility to check whenever a process is deleting a lot of data, and intervene with a confirmation dialog? Is it Microsoft's responsibility to protect you from all malware and ransomware, no matter how cleverly written? Is it Microsoft's responsibility to constantly cache program state on disk so that when a third-party program crashes, you don't lose your data since your last save?

I think that's going too far, in terms of legal obligation.

grumpyprole
2 replies
20h22m

Microsoft may be negligent in selling a product unsuitable for these applications. Windows is unsuitable precisely because it can be brought down by third party updates, such that it cannot recover without manual intervention by technical experts. Third party vendors are forced into writing unsafe kernel drivers because Microsoft does not provide sufficient user mode APIs.

Windows has a dated design and a security model no longer fit for purpose. As for your other example, it could be protecting users from malicious programs that may delete data, simply by having a better security model, like Android and iOS.

crazygringo
1 replies
19h57m

I don't think Microsoft can be negligent here, because Windows isn't being brought down by Microsoft updates.

Somebody bought Windows, and bought CrowdStrike. CrowdStrike is negligent, and possibly also the person/org who chose to rely on Windows+CrowdStrike without a backup plan if that resulted in further damages to others.

Third party vendors are absolutely not "forced into writing unsafe kernel drivers". They can properly test things to write safer code (which CrowdStrike infamously didn't). And kernel mode is fundamentally required for security software like this, as far as I understand.

And using app-based mobile OS's is not necessarily a useful comparison point. They are limited in all sorts of ways that desktop OS's are not -- and don't you hear people here on HN constantly complaining about that? A better comparison point is macOS and Linux. CrowdStrike also crashed Linux, and macOS still lets you bypass SIP if you want to.

grumpyprole
0 replies
11h42m

Third party vendors are absolutely not "forced into writing unsafe kernel drivers".

And kernel mode is fundamentally required for security software like this, as far as I understand.

These are conflicting points. They cannot both be true.

Uvix
1 replies
21h28m

Those hospitals chose to deploy software that didn't support testing. The blood is on their own hands.

goosejuice
0 replies
2h56m

This is how I feel.

If you're blindly installing software system wide, that has kernel access no less, and not accounting for failure of that software in your risk analysis then you are to blame more so than the vendor.

Certainly I expect some SLA is in place but that's only of monetary benefit and irrelevant to keeping critical infra online.

ziml77
0 replies
20h15m

Imagine I've installed CrowdStrike under the assumption that it makes my system more secure. Why would I want the OS to allow the system to boot up in a less secure state by providing a prompt for that? Most users will just click whichever option gets them back up and running and IT will have no control over that.

sudosysgen
0 replies
19h25m

Windows does do that by default. If it fails to boot, it will start an "Automatic Repair" screen and it will offer to disable drivers (ie: Safe Mode), or sometime just disable the driver itself.

The problem is that CrowdStrike doesn't want to let you start the computer without it running. It's the reason why it's an ELAM driver - it's marked as required for boot, so Windows won't try to boot without it, much like it won't if you remove crucial hardware drivers. I guess what they are trying to avoid is malware crashing the kernel driver which then gets disabled letting the malware roam free, without realizing that the cure is worse than the disease.

phendrenad2
0 replies
12h55m

So attackers just have to find a way to trick the drivers into crashing 3 times and they have full access to the system without the pesky security systems in place? Nice!

nerdjon
0 replies
20h7m

This is very much a “easier said than done” situation that I would think Hacker News of all places would be better about when it comes to “just” doing something in code.

First Windows already does something similar. After 3 it is supposed to boot into WindowsRE which gives you options to revert to a previous version, uninstall updates, and I believe also reverts configurations like recent driver installations.

The problem here though, CrowdStrike itself didn’t update. It updated a definition file (last I saw at least) and that likely would not have been caught by Windows as a new version.

Also frankly, not super thrilled at the idea of Windows just deciding to disable/uninstall something except for rolling back (so a previously working config) due to how things could interact. This situation could have been far worse and harder to recover from.

In this case maybe Windows could have noticed that the configuration update is what was causing it and rolled that back, but it’s possible it would have just re-downloaded the file when it started back up anyways.

Regarding saved lives, do we actually know that anyone’s lives were lost due to this? My local hospitals were still performing emergency surgery.

galangalalgol
0 replies
21h21m

I think sueing MS for the behavior that ensued when people installed a rootkit directly into the kernel and opened all the ports on their network to let that rootkit get used, is... excessive. Both MS and CS should have had a fail to previous good kernel ability, but the negligence here is clearly with CS for not even trying a blank data file in the automated tests for a piece of safety critical software, and then not using canary deployments before pushing to millions of devices.

Khaine
0 replies
20h30m

AFAIK Windows does do that, except for drivers that are marked as required for boot. CrowdStrike's drivers are marked as required for boot.

janice1999
17 replies
21h59m

At least they're not blaming the European Union in this breakdown (as they did earlier).

strombofulous
6 replies
21h40m

Would this still have happened if the EU had not ruled against Microsoft?

PlutoIsAPlanet
3 replies
21h35m

Microsoft can kick security vendors out the kernel, but they can't sell a product that uses APIs not accessible to other vendors.

strombofulous
2 replies
21h34m

Sure, but my question still stands - would this have happened if the EU had not made that ruling?

mort96
0 replies
21h32m

Probably

Tuna-Fish
0 replies
21h11m

Yes. There were kernel mode drivers before that ruling, it is essentially entirely irrelevant to this outage.

holsta
0 replies
21h33m

It's not about kernel access, it's about equal access to avoid yet another monopoly.

Microsoft could have come up with a kernel API that their own malware (and everyone elses) product could make use of. They did not.

extraduder_ire
0 replies
20h47m

Probably not, but in more of a butterfly-effect or this product not existing way.

ziml77
4 replies
21h8m

But the blame wasn't misplaced before. People keep saying that macOS does things better by forcing third parties out of the kernel and instead offering APIs to do the same work in userspace. Microsoft tried to do exactly this for security software in Windows, but the EU didn't like that this change meant that any Microsoft-developed solutions would have an advantage over third party ones.

ronsor
1 replies
20h56m

I really, really wish Microsoft would force third parties out of the kernel.

tacticus
0 replies
16h14m

They can. They just have to have the same rules for their products in that space.

tacticus
0 replies
20h2m

Microsoft tried to do exactly this for security software in Windows

Using a monopoly in one industry to capture the market in another industry is what anti monopoly laws are meant to prevent.

Microsoft was prevented because they wanted to retain a commercial business in their security products having special access while locking out everyone else.

Khaine
0 replies
20h29m

No, the EU didn't like MS having their malware protection in kernel while kicking out third parties.

If Defender was also kicked out, it would have been fine, but it wasn't.

whimsicalism
3 replies
21h49m

they’re right though…

DarkNova6
2 replies
21h35m

Yes. Only Microsoft should be allowed to crash their operating system. Like back in the good old days when only MS could use their secret high-performance APIs.

graeme
1 replies
21h19m

Why exactly should security vendors have the ability to crash the operating system?

dmattia
0 replies
21h2m

They shouldn't. Microsoft should have APIs that enable security vendors to work in userspace.

The EU didn't say that Microsoft couldn't kick vendors out of the kernel, just that they couldn't do so without having the APIs available that would let security vendors operate outside the kernel.

Mac and Linux have such APIs, so CrowdStrike operates in user-mode on those platforms, so those platforms do not give security vendors the ability to crash the operating system.

zh3
0 replies
21h56m

Even this is written after multiple reviews by corporate lawyers.

dmattia
13 replies
22h0m

I suppose I was expecting something more authoritative here. They confirm that there was an attempted read-out-of-bounds, as CrowdStrike said, but that's not really new information at this point. I suppose we'll need to wait for more detailed analysis from CrowdStrike at some point.

This post explains why security software has historically run in kernel-mode, and really seems to be pushing new technology that Microsoft has that would push security vendors into user-mode (with APIs that attempt to assist with many of the reasons why they have historically used kernel-mode).

Crowdstrike already runs in user-mode on both Mac and Linux (from what I can tell), and it seems like running in user-mode on Windows would significantly lessen the risk of catastrophic failures like a blue-screen-of-death. I know the bulk of the failures here belong to CrowdStrike, but I can't help but think about the fact that Apple kicked security vendors out of kernel-mode a ways back, and that if Windows had done similarly, an issue like this probably wouldn't have been possible. By even offering kernel-mode options to external vendors, I believe Microsoft is creating risk for themselves.

TillE
3 replies
21h40m

pushing new technology that Microsoft has that would push security vendors into user-mode

This doesn't exist. It's briefly hinted at in their conclusion, but right now it's simply not there.

There is no userspace equivalent of filesystem minifilters, ObRegisterCallbacks, etc.

dmattia
2 replies
21h13m

This is fascinating, thank you for the info! If I am understanding, it would have then been difficult/impossible for CrowdStrike to create a user-mode only sensor without these equivalent APIs.

So I guess I'm not sure I see validity in the claims of those blaming the EU here. It seems as though the EU would have allowed Microsoft to kick users out of kernel-space if they had APIs that allowed making security products in user-space. Like Linux/Mac already appear to have.

extraduder_ire
1 replies
20h52m

I don't think they would have had to provide those APIs in the EU, so long as their own security products were "kicked out" as well. That's kind of complicated to achieve in a permanent and provable way. Though, windows has had support for eBPF for about two years now.

TillE
0 replies
19h45m

Windows eBPF support is experimental and currently provides hooks for packet filtering stuff and nothing else.

I would be delighted if their long-term solution is eBPF which provides full anti-malware hooks, but again it's unfortunately not there yet.

GordonS
2 replies
21h8m

For one thing, being difficult to kill is huge selling point for EDR - move it to user space and it's a lot easier to kill.

pas
1 replies
19h44m

A kernel-space watchdog (that checks integrity of the image) would be much easier than a filter that updates from the internet.

Sure, the whole thing is definitely a hard problem, but CS fucking up even the most basic QA **and** error handling ... it just shows how ridiculous their whole claim to having super fancy technology is.

__MatrixMan__
0 replies
5h0m

Agreed, but focusing on their QA practices is sort of like criticizing your burglar for not wiping their feet at the window.

whimsicalism
1 replies
21h49m

The EU requires MS to provide kernel-level access to security vendors due to their crazy anti-compete provisions

dmattia
0 replies
20h28m

This seems to be only partially true when I read into it. The EU said that Microsoft would need to move their security tools into user-space (or at least to use the same APIs as are available in user-space). If they did that (like Apple has done), they could kick everyone out of kernel-space if they wanted.

Rinzler89
1 replies
21h54m

> I can't help but think about the fact that Apple kicked security vendors out of kernel-mode a ways back, and that if Windows had done similarly, an issue like this probably wouldn't have been possible

Like others already said, Microsoft already tried to do that with PatchGuard in 2006 with the launch of Windows Vista and the likes of Symantec and McAfee complained to the EU about this would harm the sales of their products, so the EU told Microsoft to not do it in 2009[1].

Apple has the luxury of a small market share on the desktop PC space to not attract the attention of the regulators, plus a user base that's used to Apple constantly rewriting the OS, deprecating APIs, switching CPU architectures, etc. without giving a fuck about breaking backwards compatibility or cutting off developers access to OS features their products use and getting away with it, luxuries that Microsoft doesn't have.

IMHO, sticking with Window's default security and not using third party anit-malware has made Windows vastly more secure and rulabile than it was in the days when you'd be looking on installing the likes of Symantec or McAfee for your "protection" which ended up acting like malware after a while throwing dark patterns at you to milk more subsection fees, so as much as it hurts their sales, it's important for the regulators to understand that security is far more important than the regulations they put on Windows for Internet Explorer and Media Player and just like Apple's apps-store, it's sometimes better to let the original product maker handle security and not leave the product open at all points just so some of these bandits can make a living selling security for it. It's like foxes complaining to regulators how chicken wire is a threat to their existence.

[1] https://stratechery.com/2024/crashes-and-competition/

nopcode
0 replies
6h53m

Microsoft sells endpoint security products and it would be unfair if third party solutions couldn't leverage the same APIs, it makes a lot of sense that a regulator steps in. I'm not aware of Apple selling security products or competing with third party security products.

michaelt
0 replies
21h21m

> Crowdstrike already runs in user-mode on both Mac and Linux (from what I can tell),

Crowdstrike provides a Linux kernel module, and expects users to manually install an extra Secure Boot key for it, as part of their corporate laptop setup procedure.

This has always seemed inadvisable to me, but checkbox checkers gotta check checkboxes I guess.

__MatrixMan__
0 replies
21h39m

I agree. Microsoft's core competency has traditionally been backwards compatibility, but if each security vendor can tamper with windows at the deepest level and is allowed to continue explore all of the ways that they can leverage that... What you end up with is a fleet of different windowses, each diverging further with time. It dilutes the benefits brought by investment into the stability of the system because whatever fights are won in one fragment must be refought in others before you can have confidence in the stability of all fragments.

It seems like madness to me.

tonymet
9 replies
20h21m

Did either release from MS or Crowdstrike explain how this crash bypassed QC? I'm still baffled that a 100% repro crash even made it anywhere near the later stages of QC. This is something easily caught by the earliest CI phases , at the developer and at least first build automation phase, let alone human QC.

magicalhippo
4 replies
20h4m

From what I read in the previous thread, their test environment didn't actually test what was deployed.

That is, there was a post-test pre-distribution packaging stage, and that's where the distributed file(s) got f'ed up.

If true that would explain how it got past their testing, but would also be an incredible lack of competence IMHO.

But yeah, curious if there's been some more concrete details there.

tonymet
3 replies
19h47m

I heard something similar. that they deploy content separately from code, but they don't test all of the combinations of code + content. This crash was from "stable" code in the driver mixed with a corrupt or incomplete content file (config, etc) , triggering the null-ptr exception .

Sounds like one of those companies where you get hired and are shocked by the sausage factory you just stepped into

rvnx
1 replies
19h40m

In February they added new code that allows to spy/block named pipes.

Named pipes are pipes of communication that processes can use to talk to each other, as an alternative to sockets.

For example Chrome uses them between the user interface and the actual page renderer.

In March they tested it in staging, said it was fine, pushed to prod with few rules in April, still looked fine.

In July they added a new rule, which was deployed to 100% immediately, as from their perspective, a new entry in a database definition doesn't need testing nor canary deploy

(which is still irresponsible, because bad rules could cause damage as well like any security/antivirus software, even if the parser didn't crash, but it could have blocked legitimate actions or files)

tonymet
0 replies
15h49m

great summary thanks for the details. I hope more companies see this and consider adding more test diagnostics

magicalhippo
0 replies
14h8m

Right, so it seems two egregious errors: no (or highly lacking) fuzzing of kernel modules acceping arbitrary input, and no testing of configuration changes given it's ingested by a kernel module.

pas
2 replies
19h41m

lack of fuzzing for their "parser + updater"

MBCook
1 replies
15h54m

Plus parsing unknown files (as in not validated to be properly formatted) in kernel space is just asking for a crash.

duped
0 replies
5h45m

My understanding is that they ship precompiled, templated scripts. The "content" updates fill in these templates with configured values. They test the templated scripts, and they validate the content, but they don't validate the content bound to the script. The garbage content was apparently valid but its behavior when used was not.

Their language for describing this design is obtuse and confusing.

jacobgorm
8 replies
22h5m

I used to work on Control Flow Integrity (CFI/XFI) research at places like MSR Silicon Valley and VMware, as far back as 2006. Back then, sandboxing a kernel module like ramdisk.sys was doable with a lot of binary rewriting magic, and later with custom LLVM passes, but nowadays it should be a simple matter of compiling the code with clang and the appropriate flags, to completely rule out this type of memory safety error, turning a BSOD into a polite log message and disabling the faulty driver.

torginus
3 replies
21h55m

from what I understand, CrowdStrike has essentially put a Turing-complete interpreter for their scripting language into the kernel. I doubt you can do much when something is that general purpose.

capitainenemo
0 replies
21h45m

Do you have more information on that? Hadn't read anything about the CS kernel module running arbitrary code. Was it a factor in the crash?

'course, Microsoft also put turing complete scripting in ring 0 years ago for performance reasons (TTFs - XML/HTML parsing and GUI rendering too - to beat other OSes apparently) and that certainly did lead to exploited vulnerabilities...

https://googleprojectzero.blogspot.com/2016/07/a-year-of-win... https://gist.github.com/Nevor/ed3719dad0cf66893e42a9ba024c91... https://learn.microsoft.com/en-us/security-updates/securityb... https://www.fortinet.com/blog/threat-research/one-bit-to-rul... https://learn.microsoft.com/en-us/security-updates/SecurityA... https://news.ycombinator.com/item?id=9769099 (this comment in particular https://news.ycombinator.com/item?id=9783863)

pcwalton
3 replies
21h59m

I mean, this is basically what eBPF accomplishes in Linux.

capitainenemo
0 replies
18h27m

... oh. and the article by Brendan Gregg in the HN link above had the telling phrase: "Once Microsoft's eBPF support for Windows becomes production-ready"

akira2501
7 replies
21h59m

where security and availability are non-negotiable.

Yep. You just have to pretend that everyone who deployed Windows had an actual competitive choice available to them.

A second benefit of loading into kernel mode is tamper resistance.

I guess availability is negotiable after all.

qsdf38100
6 replies
21h47m

Yep. You just have to pretend that everyone who deployed Windows had an actual competitive choice available to them.

Could you elaborate? How is that related to security and availability being non negotiable?

akira2501
5 replies
21h27m

Microsoft's statement implies that people choose Windows because of it's security and availability. Whereas most people end up with Windows because the software they want to run only operates on that single platform.

The security and availability, to the extent they even exist, are clearly not part of the market's decision making process.

jojobas
2 replies
17h28m

The critical infrastructure that people actually cared about (ATC for example) had all the choice in the world. So did people designing bespoke POS systems. To rephrase the old IBM trope, "nobody got fired for choosing Windows".

akira2501
1 replies
13h49m

critical infrastructure that people actually cared about

Did hospitals have a choice in which OS their MRI machine runs? Are those not "critical?" Or should we just not "actually care?"

ATC for example

Were they impacted by this outage? Isn't the reason flights were canceled is because the _carriers_ systems are the ones that had issues?

So did people designing bespoke POS systems.

Really? I would assume most of that is down to the hardware, like cash drawers, credit card readers and order printers, which is likely third party and proprietary, and is only available and supported on Windows. Do you have evidence otherwise?

"nobody got fired for choosing Windows".

You've recognized the same outcome I have but have gone to great lengths in an attempt to obscure the reasons for it happening. Why?

jojobas
0 replies
13h33m

There are Unix-based MRI machines. When buying one, I would imagine the OS is rarely a consideration.

There were reports of airports disallowing landings of incoming planes due to controllers being unable to provide separation.

There are card readers available for Linux, it has been long standardized. Cash drawer is a single solenoid.

Major supermarket chains order and supervise delivery of custom POS solutions integrated the way they want from companies like NCR.

NCR offers Windows, Android and Linux POS bases, but supermarkets tend to choose Windows.

manquer
1 replies
13h11m

only operates on that single platform

Nope, they aren't ready to pay for another platform that is their choice . If customers paid for linux or mac support there is no shortage of developers ready to cater to that.

Unwilling to pay for multiple platforms is still a choice

Dylan16807
0 replies
10h5m

If customers paid for linux or mac support there is no shortage of developers ready to cater to that.

If enough of them do. A single customer can't convince companies to add that support by paying a reasonable amount.

Since there's no collective bargaining happening, this does not show a lack of willingness to pay.

Also, you're ignoring all the software that is already paid for and often doesn't have developers any more.

waterTanuki
5 replies
15h43m

I am still to this day gobsmacked how a company the size of Microsoft doesn't do all of it's security in-house like Apple, which locked down kernel access to macos some time ago. The blame is mostly on CrowdStrike, but Microsoft does share responsibility in allowing third-parties to pepper the kernel with whatever code they want to.

yawaramin
2 replies
13h5m

The EU didn't make you click 'Accept cookies' on every website, websites decided to interpret EU regulations as 'Let's make it super annoying for users to not accept, so they accept out of laziness'.

breadwinner
1 replies
5h9m

websites decided to interpret EU regulations..

Does it matter? Ultimately, the impact to the citizenry is what matters.

ParetoOptimal
0 replies
3h9m

It does matter because you spin things to make it sound like the EU is anti-consumer when they are frequently the ones improving consumer protections and privacy.

motohagiography
0 replies
15h6m

the trade off is either you make your own hardware, or make exceptions to integrate with other OEMs, which causes the fragmentation problem that security vendors exist to provide solutions for. Apple doesn't have that fragmentation problem, but they also don't have as rich an enterprise ecosystem. Android is the definition of a fragmentation problem, where instead of resolving it, Google manages it and calls it an ecosystem, and they're right.

Fragmentation makes consistent governance/security impossible, but the heterogeneity also limits the scope of incidents. Apple balances its greater monoculture risk with deep control of the underlying hardware, to where as a user or developer you're only ever interacting with a very, very high level abstraction.

superposeur
5 replies
20h30m

I’m surprised no one has yet noted that Microsoft itself is a chief CrowdStrike competitor.

tonymet
4 replies
20h22m

i thought crowdstrike provided features that go beyond windows defender. is there another MS product that competes?

ryukoposting
1 replies
16h16m

Interesting that they bill Defender as "requiring frequent OS updates," alleging that their solution is somehow better in that regard. Are the suggesting that by installing CrowdStrike, you don't need to update Windows anymore? It really reads like that.

aaronmdjones
0 replies
9h36m

That's hilarious considering that it was a frequent CrowdStrike update that resulted in this chaos.

abhinavk
0 replies
14h59m

There is a paid version called Microsoft Defender for Endpoint.

Animats
3 replies
17h54m

So how did this kernel level driver get through WHQL verification? The Static Driver Verifier should have caught this.[1] Do some security vendors get to bypass that? Microsoft is very quiet about that.

That's the sort of thing a negligence lawyer focuses on. Partner at Brown Rudrick: "The most likely legal theory will be one of negligence. [Congress] will drag the guy over the coals, they'll maybe implicate him and his company and put in place a negligence action. There'll maybe be a couple of plaintiffs lawyers who dig up some exceptional theory on negligence, and get some class action lawsuits going. Again, we still don't know all the facts in this case, and there are other dimensions which have not yet been fully explored, including how CrowdStrike had access to kernel level updates on the Microsoft operating system? How come Microsoft didn't have any control over these updates being pushed on their kernel?"

The first two class actions are already starting.

[1] https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[2] https://www.channele2e.com/analysis/crowdstrike-legal-and-li...

whyever
0 replies
1h24m

Not all potential null dereferences are covered by the verifier, they even give an example where the rule is not triggered, but null may be dereferenced by the code.

ldjkfkdsjnv
2 replies
22h8m

The true story is that I bet some major divisions of Crowdstrike are ran by non technical people that got there through non meritocratic means. Theres generally been no repercussions for their underperformance, much like boeing. Crowdstrike business is built on relationships, not technical supremacy. And bada bing bada boom, we have a complete failure of basic technical competency (no rigourous role out process).

Paianni
1 replies
21h56m

All business are built on relationships, technical competency can but doesn't have to be a means to that end.

Wytwwww
0 replies
21h53m

technical competency

In a more fair world (that also valued economic productivity/growth more) companies which completely ignore that wouldn't survive, though.

WalterBright
2 replies
12h35m

What I heard is that CrowdStrike normally rate limits pushing a fix. This is so that if the fix is bad, the damage is limited. But for some reason, the rate limiter was turned off and the update went out to everyone.

self_awareness
1 replies
11h14m

What I heard is that CrowdStrike normally rate limits pushing a fix. This is so that if the fix is bad, the damage is limited. But for some reason, the rate limiter was turned off and the update went out to everyone.

Is this true though? They've released a Post Incident article:

https://www.crowdstrike.com/blog/falcon-content-update-preli...

in which they state:

How Do We Prevent This From Happening Again? Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

So it seems, if I understand this correctly, they've just implemented the rate limiter as a response to this incident.

WalterBright
0 replies
2h33m

Either version might be true!

userbinator
1 replies
20h40m

I'm going to be the controversial one here and say that, as bad as CrowdStrike was, the alternative of having only Microsoft be able to decide what people can do is far worse. I've already seen many others trying to use this incident to advocate for digital totalitarianism.

scarface_74
0 replies
16h32m

Microsoft as the OS vendor will always be a potential source of updates that crash computers. Now with a third party, you’re adding another level of risk.

someonehere
1 replies
17h52m

Unless actually required by your org, choose the N -1 policy in CS to avoid snafus like this in the future. It’s in the console so use it.

zh3
0 replies
21h46m

I do have to wonder how many agonising layers of review this went through with the marketing and legal departments as part of shifting the blame.

If you want to decide which OS/distros to avoid for critical stuff, look to see who's learning from the incident (even if not bitten by it) compared to those saying "it wasn't our fault" (and that's not just MS).

squirrel
0 replies
19h30m

Telling that there’s no mention of eBPF, which is standard on Linux and available on Windows, but hasn’t been brought into the main Windows OS. Static analysis might or might not have caught the Blue Friday bug, but it certainly increases the protection level over the current do-as-you-wish model for kernel modules.

sammyteee
0 replies
19h56m

I stopped reading after "Windows is an open and flexible platform"

rldjbpin
0 replies
9h47m

one thing from this whole fiasco that i wished bring to conversation was the fact that (crucial/market-dominant) digital/IT services don't have the same level of liability as mundane, physical goods.

a simple plastic covering of your new dyson has more legal scrutiny and action (see the "children may choke" warnings they all need to come with) than software that we otherwise block in the name of "national security".

given how much overvalued tech companies are in this region, i believe it is high time to start legally recognizing the real-life impact of digital tech. to hell with the "but muh innovation" argument.

eqvinox
0 replies
19h7m

Move tool-tip APIs from kernel to user mode

?!?!

aurelien
0 replies
11h11m

You use a distribution made with foot for secretary and gamers and you blindly try to explain where the problem is.

You are the clown's of the world, that's all ... xD

EasyMark
0 replies
20h50m

Oh I like this breakdown a lot. Fairly technical, links to resources used, flow of debug process, didn’t get lost in a the weeds of details and how clever they were. I wish more debug retrospectives were like this. It seems like you end up with 100 pages of analysis or a couple of vague paragraphs.

DeathMetal3000
0 replies
20h0m

“Windows has announced a commitment around the Rust programming language as part of Microsoft’s Secure Future Initiative (SFI) and has recently expanded the Windows kernel to support Rust.”