HN comments for: Microsoft technical breakdown of CrowdStrike incident

rdtsc

207 replies

21h58m

2024-07-28 20:31:58 UTC

We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability.

Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products.

Reducing the need for kernel drivers to access important security data.

They are being as diplomatic as they can, but it's definitely a slap to CS. Read as "they don't know how to roll things out, they need guidance on basic QA practices, we'll happily teach them...". Then, they list a set of facilities running in user-mode to avoid needing to run as many things in kernel mode.

I would be interested what the water cooler discussion about CS was like inside Microsoft. Especially in teams needed to respond to customers about "Your windows OS is broken, our hospital patients are suffering...".

f001

68 replies

20h26m

2024-07-28 22:04:00 UTC

I can tell you they’re quite unhappy about it. Have a friend working there who frustratedly says it wasn’t their fault every-time it comes up. Which is quite often and at every social occasion since.

fishywang

65 replies

19h48m

2024-07-28 22:42:53 UTC

but it's kind of their fault? they designed the api that way, they decided what can be done in userland and what must be done via kernel. they at least _allowed_ it to happen every time.

freeopinion

19 replies

15h35m

2024-07-29 02:55:45 UTC

When a parking valet takes a car on a joy ride and crashes into a tree, we could blame the tree. We could blame the car owner for handing over the key. We could blame the auto manufacturer that didn't provide a "valet mode". We could blame the police for not detecting the joy ride before the crash.

All of these parties could do better (stupid tree!). But the real problem is the valet.

We can say that it is obvious that the electronics-heavy cars of today should anticipate rogue valets and build in protections. But we shouldn't let rogue valets off the hook for damages.

As a consumer, you could choose to only purchase cars that have "valet mode". So should we blame consumers who don't? If so, we should blame the airlines, hospitals, etc.--not Microsoft.

How about we prosecute valets unless they refuse to park cars that don't have "valet mode"?

Proziam

14 replies

15h1m

2024-07-29 03:29:17 UTC

You could also prosecute the establishment that keeps a valet with an abominable record on staff.

Microsoft took no steps to force-eject them from their ecosystem, despite their long history of issues.

freeopinion

7 replies

13h34m

2024-07-29 04:56:25 UTC

Just to be clear within the analogy: are you expecting the auto manufacturers to "force-eject" any hotel on Park Ave that has a record of valet mishaps? Or did you mean individual cars should force-eject the valet?

If a Caesars Entertainment property in Macao has enough incidents, should GM update the firmware on their automobiles to force-eject valets at Caesars Entertainment properties in Las Vegas?

Now imagine that GM actually operates valet services in Macao and Las Vegas. Should they be allowed to force-eject valets from competing services?

I am not a Microsoft apologist. I think they should do better. I think Linux and FreeBSD should do better. I personally avoid Microsoft products. But I place more blame on people who use MS products than I do on MS. After all, I never intend to hand my beat up old Corolla over to a valet so why should I have to pay for a "valet mode" feature that Toyota is forced to build into all their cars? Isn't it reasonable that motorcycles, 18-passenger vans, and scooters don't need "valet mode"?

In my book, the auto manufacturer is lower on the list of culprits than the valet, "the establishment that keeps a valet with an abominable record on staff", and the vehicle owner. But some place like Car and Driver could definitely prioritize encouraging GM or Toyota to develop valet modes over berating owners; so I don't mind a place like HN shooting a few arrows at MS. Unless the general public follows their lead and lets bad guys off the hook by shifting too much focus to somebody lower on the list.

mejutoco

5 replies

11h11m

2024-07-29 07:19:07 UTC

Just to be clear within the analogy: are you expecting the auto manufacturers to "force-eject" any hotel on Park Ave that has a record of valet mishaps? Or did you mean individual cars should force-eject the valet?

Not OP, but I think the analogy here is the hotel "fore-ejecting" (firing) the valet with a history of doing joy rides. That seems very reasonable.

lucianbr

3 replies

10h39m

2024-07-29 07:51:29 UTC

In the analogy, it seems Microsoft is a car manufacturer. The hotel is the company that bought software from CrowdStrike. The problem is that Microsoft should not control who has access to which APIs, that is a huge can of worms, and actually called anticompetitive by the EU from what I understand. At MS level, either they publish APIs or not. If published, anyone should be able to write software for them. This is especially bad if MS themselves also sell security software that uses the same APIs. It would literally mean MS deciding who is allowed to compete with their security software.

mejutoco

2 replies

10h7m

2024-07-29 08:23:33 UTC

I think it works better (please allow me to change it) if Microsoft is the hotel. Crowdstrike is the restaurant inside the hotel. The restaurant is serving poisoned food to the guests, who assume it is a decent restaurant because it is in their hotel.

Also the restaurant has their own entrance without security and questionable people are entering regularly, and they are sneaking into the hotel rooms and stealing some items, breaking the elevator.

At the same time, the hotel is in a litigation process with the restaurants association, because in the past they did not allow any restaurant on their premises. The guests, naturally, do not care about this, since their valuables have been stolen, and they have food poisoning. The reputation of the hotel is tarnished.

PretzelPirate

1 replies

5h19m

2024-07-29 13:10:56 UTC

if Microsoft is the hotel

I don't think this works since Microsoft isn't the hotel. The hotel in your example chooses which restaurants are inside, but Microsoft doesn't. In this example, Microsoft is the builder who built the hotel building for a 3rd party. That 3rd party decides which restaurants it wants to partner with, as well as any other rules about what goes on in the building.

If the builder came around and made changes to ban the 3rd party's restaurant partner, that would cause a ton of issues and maybe get the builder sued.

Microsoft can't decide what can and can't run on their platform - the most they can do is offer certification which can't catch everything, as we just saw with Crowdstrike since they decided to take a shortcut with how they ship updates. Microsoft also had to allow for equal API access so they don't get sued by the EU.

mejutoco

0 replies

1h14m

2024-07-29 17:16:23 UTC

Operating system (hotel) decides which programs run in kernel mode (Crowdstrike) but ok. Let me address the other point.

Again the reasoning of allowing equal API access to avoid getting sued is a false dichotomy: Microsoft could choose to make an OS that would not need such mechanisms to be simply usable.

They could also remove their own crowdstrike-alike offering, so that it would not be considered anti-competitive. They could also choose not to operate in EU. Of course, that would lower their profits, which is the real motive here.

Once you sum it up the reasoning goes: hospitals/flights can stop working because a company cannot lower its profits, and said company is not to blame at all. It is clearly false, the rest is sophism, and back-bending arguments IMO.

Proziam

0 replies

3h38m

2024-07-29 14:52:51 UTC

This is the correct interpretation. I am surprised that people took it in different directions.

Proziam

0 replies

3h23m

2024-07-29 15:07:38 UTC

I'm expecting restaurant owners to fire bad valets.

Or in Microsoft's case, via regulatory, social, or software, prevent Crowdstrike from causing harm to their customers.

I'm aware it's a sticky regulatory situation, but CS has a history of these failings and the potential damage could be severe. Despite this, no effort (that I am aware of) was made by Microsoft to inform customers that Crowdstrike introduced potential risks, nor to inform regulators, nor to remove the APIs CS depends on.

I don't believe Microsoft is solely responsible, but I do believe that throwing all of the blame for the very real harm that was caused onto CS alone is missing a piece of the puzzle.

Last aside, every large corp has team(s) focused on risk. There's approximately zero chance they didn't discuss CS at some point. The only way this would not have happened is negligence.

rk06

1 replies

13h48m

2024-07-29 04:41:57 UTC

Can Microsoft legally ban a competitor for percieved incompetence? I doubt it . partiuclarly seeing how much competence is shown with windows and MS teams software

sim7c00

0 replies

10h40m

2024-07-29 07:49:59 UTC

Microsoft assigns driver levels to these guys etc. and allows them to load kernel mode components as protected etc.. If they do not allow that - CS cannot cause such damages. ofcourse, as you pointed out, this will then turn into some lawsuit blaming MS for killing competitors, even if they do it to try and protect their customers.

wonderful world.

cratermoon

1 replies

13h11m

2024-07-29 05:18:56 UTC

Back in 2006 Microsoft tried to keep 3rd party vendors out of their ecosystem. <https://arstechnica.com/information-technology/2006/10/7998/> As a result of a complaint to the EU Microsoft was required to let them have kernel access. <https://www.theregister.com/2024/07/22/windows_crowdstrike_k...>

Dylan16807

0 replies

10h14m

2024-07-29 08:16:39 UTC

Microsoft was required to let them have the same access their own software used. Which seems fair to me. Microsoft can remove those APIs entirely, they just can't restrict them.

seanmcdirmid

0 replies

11h31m

2024-07-29 06:59:29 UTC

Microsoft took no steps to force-eject them from their ecosystem, despite their long history of issues.

I’m pretty sure anti trust law doesn’t allow Microsoft to go anywhere near that kind of action, even if they wanted to be more Apple like.

Ekaros

0 replies

9h58m

2024-07-29 08:32:28 UTC

Problem is that the establishment here is well the establishment. That is the state itself. Or at least one of them. As somehow MS is in position where for any slight anti-trust thing they will be prosecuted. Our system is setup to allow these actors in...

naasking

2 replies

8h48m

2024-07-29 09:42:05 UTC

All of these parties could do better (stupid tree!). But the real problem is the valet.

No, the operating system is supposed to provide secure access to hardware and isolate independent subsystems so they can't interfere with each other. That's its whole purpose for existing. The fact that people feel they need to deploy CS is a Microsoft failure. Windows is just not a secure OS.

mynameisvlad

0 replies

2h13m

2024-07-29 16:17:22 UTC

You’re shifting practically the entirety of the blame to a company that at best was an accomplice to the issue.

I get that you hate Microsoft, but not everything is their fault and it’s disingenuous to pretend otherwise.

ing. The fact that people feel they need to deploy CS is a Microsoft failure.

CS is also available and widely deployed on Mac and Linux. Is that a failure of Apple and all the distros? It literally took down Debian and Red Hat systems earlier this year, is that also not CS’s fault?

kasabali

0 replies

5h30m

2024-07-29 13:00:47 UTC

The fact that people feel they need to deploy CS is a Microsoft failure

They don't need to deploy shit. Only reason it's deployed because it's a whole racket.

goosejuice

0 replies

2h49m

2024-07-29 15:41:51 UTC

You could also choose to park the car yourself or plan for a secondary mode of transportation if something happened to your car.

Not the best analogy. The organization who deploys said software is responsible for the uptime of their systems. They didn't have to use CrowdStrike and if they do they should have a plan in the event of failure.

skissane

12 replies

19h21m

2024-07-28 23:09:50 UTC

they designed the api that way, they decided what can be done in userland and what must be done via kernel

They didn’t have much of a choice - it is very hard to get adequate performance with real-time filesystem filtering without doing it in kernel mode. Not aware of any other mainstream OS which succeeds at that.

And they kind of had to provide this feature, since they’ve supported it since forever (antivirus vendors were already doing it back in the days of MS-DOS and Windows 3.x/9x/Me), and there is a lot of market demand for it. It is easy for Linux to say “no” when it never has had support for it (in official kernels)

But, as the blog post points out, it sounds like CrowdStrike is doing a lot of stuff in kernel mode that could be done in user mode instead - whether due to laziness or lack of investment or lack of sophistication of their product architects

they at least _allowed_ it to happen every time

Microsoft, in allowing third party code to be loaded into their kernel, is no different from other major OS kernels, such as Linux or Apple XNU.

Apple is (increasingly) the most restrictive about this, and a lot of people criticise them for it.

Even Linux imposes some restrictions-which kernel symbols to export (at all or as GPL-only)—although of course being open source, you can circumvent all restrictions by changing the code and recompiling

fsociety

11 replies

18h52m

2024-07-28 23:38:05 UTC

Mac and Linux run EDRs in userspace without an issue. No one here has an excuse or no choice.

dralley

9 replies

18h49m

2024-07-28 23:41:51 UTC

Linux these days tends to use eBPF which isn't really in userspace per-se.

djbusby

8 replies

18h43m

2024-07-28 23:46:56 UTC

eBPF is like the Twilight Zone. I'm in kernel space but, I'm not.

speed_spread

3 replies

17h43m

2024-07-29 00:47:08 UTC

eBPF is Linux denying the fact that it's turning into a microkernel and that Linus was wrong.

markmark

2 replies

11h59m

2024-07-29 06:31:53 UTC

If you're right for 30 years in tech you're right, even if things eventually change.

skissane

1 replies

11h20m

2024-07-29 07:10:00 UTC

The famous Tannenbaum-Torvalds debate happened all the way back in 1992. At the time, the most common microkernel was Mach, which had significant performance problems. NeXT/Apple solved them by transforming Mach into a monolithic kernel, making Mach (as XNU) one of the most popular kernels in the world today (powering iPhones, iPads, Macs, etc). But that doesn’t help Tannenbaum‘s side of the argument. And I don’t believe his own Minix did much better than Mach did.

Whereas, from what I hear, L4 and its derivatives have solved this problem in a way that Mach/Minix/etc could not. Yet still, it makes me wonder, if L4 has really solved it, why aren’t we all running L4? L4 has had some success in embedded applications (such as mobile basebands, Apple Secure Enclave); but as a general purpose operating system has never really taken off.

sidewndr46

0 replies

4h46m

2024-07-29 13:44:36 UTC

from what I understand a huge number of computers run Minix, but only in the Intel Management Engine

LtWorf

3 replies

17h19m

2024-07-29 01:11:31 UTC

Well they crowdstrike crashed a kernel with it

skissane

0 replies

17h11m

2024-07-29 01:19:13 UTC

Apparently that wasn't (entirely) CrowdStrike's fault: https://news.ycombinator.com/item?id=41030352

Whereas this Windows outage rather obviously was.

eBPF being able to crash the kernel is usually sign of a kernel bug. And it sounds like in this case it was even a bug specific to Red Hat kernels, introduced by a Red Hat patch.

That said, even if they are triggering a Red Hat kernel bug, CrowdStrike should be testing their software adequately enough to pick up that issue before customers do – and it sounds like they haven't been

pclmulqdq

0 replies

16h55m

2024-07-29 01:35:31 UTC

That was more of a kernel bug than a crowdstrike bug. However, it's clear that they are pushing what you can do in kernel space to the limits, which is not a great sign.

IsTom

0 replies

4h36m

2024-07-29 13:54:41 UTC

Isn't being able to crash anything with eBPF is a bug in either kernel or eBPF? As I understand it's supposed to prevent exactly that.

feyman_r

0 replies

16h48m

2024-07-29 01:42:12 UTC

Can you re-read the list (source Wikipedia) in one of the comments in the tree? It had Debian And RedHat issues listed on different dates.

lozenge

11 replies

19h43m

2024-07-28 22:47:49 UTC

You can't just let people do anything from userland, the performance would tank. As for restricting kernelland, EU competition regulators would not be happy if MS was the only one able to write anti virus software that runs in kernelland.

ahepp

5 replies

19h9m

2024-07-28 23:21:16 UTC

You can't just let people do anything from userland, the performance would tank

Isn't the point of userland that you can (try to) do anything from there?

It seems like MacOS and Linux provide substantially safer alternatives that are still performant?

As for restricting kernelland, EU competition regulators would not be happy

I keep seeing people say this. Is there a basis for that assertion, or is that mere speculation? Again, hasn't MacOS already deprecated kexts?

intern4tional

2 replies

18h11m

2024-07-29 00:19:29 UTC

There is basis for that assertion.

Via Google: https://www.techtarget.com/searchsecurity/news/450420491/Mic...

(Also via myself, as I was at MS when we wanted to make this change and the EU said no.)

philistine

1 replies

16h25m

2024-07-29 02:05:01 UTC

Well Microsoft did not publicly commit to using the same APIs, and no privileged access, for its own antivirus products. That's why the EU said no way; not because kernel access was revoked.

guiriduro

0 replies

10h12m

2024-07-29 08:18:07 UTC

Yes, but then of course Microsoft is being obligated to open part of kernelspace to competitors, which is arguably "OK" from a competitive regulation perspective, but that then places a special burden on competitors to maintain code hygiene given the potential for crashes. It makes CrowdStrike's negligence all the more unacceptable.

pjmlp

0 replies

10h31m

2024-07-29 07:59:20 UTC

MacOS still keeps the kexts support around, even if the long term roadmap is to move everything into userspace.

112233

0 replies

16h19m

2024-07-29 02:11:06 UTC

What are the Linux alternatives you are talking about?

justinclift

4 replies

19h31m

2024-07-28 22:59:00 UTC

[flagged]

throwaway237289

1 replies

19h11m

2024-07-28 23:19:33 UTC

[flagged]

dang

0 replies

50m

2024-07-29 17:40:01 UTC

Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.

(Your comment would be fine without that first bit.)

https://news.ycombinator.com/newsguidelines.html

hilbert42

0 replies

14h36m

2024-07-29 03:54:12 UTC

There are ways around this that I've discussed elsewhere so I won't repeat them here.

However, think of it this way: Windows restarts, tries to load with new patch and crashes.

Question: why can't Windows be designed so that on crash it automatically restarts and loads the previous state sans patch?

Answer: Windows could be designed that way but it would require Microsoft to do many things it doesn't want to do. Some of which would require Microsoft to go back to the beginning and reengineer quarter-century or more old code from scratch, that means redesigning APIs and the underlying architecture from first principles.

Why doesn't Microsoft want to do this? It's obvious so I won't bother to spell it out.

Nevertheless, when the dust fully settles and someone outlines these alternative design strategies in great detail then it'll be obvious to everyone what a fragile stack of cards Windows has been constructed on.

dang

0 replies

51m

2024-07-29 17:39:29 UTC

Please don't post in the flamewar style to HN, such as you did here and downthread (https://news.ycombinator.com/item?id=41096774). It's not what this site is for, and destroys what it is for.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

nilamo

8 replies

19h8m

2024-07-28 23:22:24 UTC

Your car _allows_ you to drive off a cliff. If you do so, it is your fault, not the fault of the car manufacturer.

Kind of weird that anyone is blaming Microsoft for any part of this, imo

wokwokwok

3 replies

18h56m

2024-07-28 23:34:44 UTC

Mmm… meaningless analogies are kind of meaningless?

More like:

If you install a security product that then prevents your car from starting; are they entirely blameless for letting you install it?

If you pull the hood up, tear off the “voids warranty” seal, ignore the “don’t open this” labels, crack the seals open and shove something into the engine… sure.

…but if you just slap a widget with the “vendor approved” sticker on your dash and it bricks your car; that’s a bit sucky right?

I do feel Microsoft is not entirely blameless in this.

It should be easier to recover from this kind of thing.

They should have been paying attention and made a fuss that one of the biggest security vendors has been doing this literally since they started.

I would bet money that until two weeks ago Microsoft was high-5ing them for best security practices.

It’s not “their fault” but they can’t just go “wasn’t us!”.

It was them.

It wasn’t macOS. It wasn’t *nix.

Suck it up. They should’ve done better.

krige

1 replies

11h10m

2024-07-29 07:20:20 UTC

Except Crowdstrike had 3 separate Linux incidents, including kernel panics, directly before this happened.

happymellon

0 replies

3h38m

2024-07-29 14:52:03 UTC

And at least one of them was actually a Redhat kernel bug, where eBPF caused a kernal panic when it shouldn't be able to?

prmoustache

0 replies

10h35m

2024-07-29 07:55:45 UTC

That is the problem: you feel.

Before Microsoft comes into the picture the issues is crowdstrike pushing updates without proper testing, selling a product on which customers cannot control the update schedule, and customers for being so naives and not checking what the product they install on critical stuff do.

fishywang

3 replies

19h0m

2024-07-28 23:30:44 UTC

The big difference is that CS is not the user. In you analogy it's like your car allows you to drive off a cliff, and an (almost) essential part of your car (for example, the pedal) drives the car off a cliff.

vel0city

0 replies

17h44m

2024-07-29 00:46:20 UTC

CS is not the user

It got there because a user or administrator approved and installed it. It didn't just appear there, Microsoft didn't install it there. The user ran it.

nilamo

0 replies

18h10m

2024-07-29 00:20:22 UTC

Right, so a slightly better analogy would be if you wanted to install a remote starter, but then you find out that they can only be installed into Fords, because other auto manufacturers (Apple, Linux in this case) believe that tampering with the critical path (the engine, kernel) is unsafe. It isn't Ford who's at fault for allowing you to run some random engine modification, it's that mod that is at fault.

jayd16

0 replies

18h47m

2024-07-28 23:43:02 UTC

If it's a custom after market part, how can you blame the car manufacturer and not the part maker?

a-dub

7 replies

19h12m

2024-07-28 23:18:51 UTC

i would have thought that in 2024 a bad driver update is something that windows would automatically roll back.

or at least provided some level of protection against crashes in third party kernel code.

sashank_1509

3 replies

17h13m

2024-07-29 01:17:33 UTC

No you can’t roll back bad driver updates in any OS, if you could then by definition they do not sit in the kernel space. You just want the security code to not run in kernel space, which is a decision MS could maybe make and become like Apple, though most security software would in that case rebel.

fragmede

0 replies

16h25m

2024-07-29 02:05:47 UTC

it depends on how bad. in Linux you can rmmod to get rid of the bad one if you haven't wedged it and fix your code, compile, and try again. I can't imagine that's actually different on windows if you know what you're doing. how do you think driver development happens?

a-dub

0 replies

16h56m

2024-07-29 01:34:18 UTC

No you can’t roll back bad driver updates in any OS, if you could then by definition they do not sit in the kernel space.

drivers and kernel binaries are typically installed and maintained by user space programs that run with some sort of elevated privileges.

"kernel space" is just a runtime context, what gets loaded into there typically comes ordinary (protected) files on the disk.

Dylan16807

0 replies

10h45m

2024-07-29 07:45:52 UTC

That doesn't make any sense.

The OS loads file A into the kernel. It crashes. It reboots. It decides not to load file A this time.

Wow, it's a rollback of kernel-space code.

Unless your argument is that you can't guarantee a rollback of every possible kernel driver, because it might have installed a rootkit while it had full control? Okay, cool, but this isn't a malware removal idea. It's an idea for normal drivers.

wierdstuff

0 replies

13h51m

2024-07-29 04:39:37 UTC

Good explanation about this point at 11:15 over at https://youtu.be/wAzEJxOo1ts?si=wGXDJZtUczcIui9F

VohuMana

0 replies

17h44m

2024-07-29 00:46:02 UTC

I think if I understand the systems right Windows can roll back a bad driver update but the CS update wasn’t an update to the driver but instead updated a configuration file which CS updated outside of Windows Update. So from the Windows Update perspective the system started failing to boot with no changes to the system. Again though I don’t know if I totally understand what CS did and what capabilities Windows Update has.

TiredOfLife

0 replies

9h58m

2024-07-29 08:32:24 UTC

It was not a driver update.

scarface_74

0 replies

16h44m

2024-07-29 01:46:54 UTC

Microsoft tried to lock down kernel access in the Windows Vista era. Antivirus vendors went crying to the EU and they forced Microsoft to allow access to the kernel to third parties.

Iwan-Zotow

0 replies

19h31m

2024-07-28 22:59:51 UTC

it's like userland video driver - thousands context switches per second, performance will dive...

999900000999

0 replies

18h41m

2024-07-28 23:49:42 UTC

An OS flexible enough where you can do something stupid enough to completely break it.

Basically IOS which is so locked you can't even run apps not expressively approved by Apple.

Pick one. If I build a bike and you remove the breaks to save weight don't get mad at me when you crash.

thejournalizer

0 replies

4h54m

2024-07-29 13:35:56 UTC

Honestly most of the conversations were about getting everyone back online.

mns

0 replies

11h0m

2024-07-29 07:30:42 UTC

I noticed this at work and in some other contexts last week. We weren't affected by this, but most of the people that brought this up, even technical people (other fields, not security or OS or anything like that), think that this was a Microsoft and Windows issue. they all seem surprised to hear that Microsoft wasn't the root cause of this, and they all seem surprised, because no one knows or understands what Crowdstrike is or does.

holsta

52 replies

21h29m

2024-07-28 21:01:44 UTC

they need guidance on basic QA practices

Microsoft has a loooong history of botched (security) updates, so I'm not hopeful they can teach Crowdstrike much.

Rinzler89

44 replies

21h21m

2024-07-28 21:09:28 UTC

Do you happen to have a list of that "loooong history" of botched (security) updates?

I can only find a couple of examples after googling, which a bit smaller than a "loooong history" you're talking about, so unless Microsoft is paying Google to delete results, maybe you're mistaken.

SoftTalker

28 replies

21h16m

2024-07-28 21:14:03 UTC

This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes. Anybody who was paying attention knew that you didn't use any new Windows release until at least the first service pack had come out.

Granted that was a while back but painful memories die hard.

Rinzler89

20 replies

21h14m

2024-07-28 21:16:21 UTC

>This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes.

That was WIndows XP 20 years ago. Please bring arguments about modern Window 11 security which is the current up to date product they're selling and supporting not scenarios that haven't happened in 20 years.

Eduard

12 replies

20h33m

2024-07-28 21:57:32 UTC

for a loooong history, you have to look in the past

Rinzler89

11 replies

20h26m

2024-07-28 22:04:15 UTC

Ah, well, if only things of the past were useful today, I'd still have hair, and probably millions made form right investments, but unfortunately, it's what's happening today that actually matters.

echoangle

6 replies

19h53m

2024-07-28 22:37:08 UTC

So you asked for proof of a long history and are now surprised that the examples are all from the past?

Rinzler89

5 replies

19h24m

2024-07-28 23:06:29 UTC

How does that impact the present? If it's no longer as vulnerable today, why would I care about the past? The point is learning from mistakes and fixing them so that doesn't happen again.

echoangle

3 replies

19h21m

2024-07-28 23:09:40 UTC

If it doesn’t matter to you, why did you ask? Are you just trying to win an argument or are you being intellectually honest? Because you asked for proof of the long history someone claimed. You could have just said “the long history doesn’t matter because I only care about the current state”. That’s fine and valid, but don’t ask questions and then shift the goalposts if you don’t like the answers.

Dylan16807

2 replies

10h35m

2024-07-29 07:55:36 UTC

A "loooong history" needs to have a timespan of many years.

So yes it would start in the past, but it then has to continue for a long time.

Pointing out that a company was bad 20 years ago isn't enough. You need to show they were also bad 15 years ago, and 10 years ago, and 5 and/or 25 years ago.

So complaining that the only evidence was so far in the past is valid. The original goalposts were not reached. (Well, someone in another part of the thread eventually listed every google result for a windows update making anything crash, but that doesn't really establish that microsoft is "botching" updates at a level significantly above background noise, which I think was the original intent.)

echoangle

1 replies

10h2m

2024-07-29 08:28:43 UTC

Well someone posted examples from XP and someone else posted 4 botched updates in 2023, do you need a list for every year inbetween?

Dylan16807

0 replies

9h33m

2024-07-29 08:57:14 UTC

Was my implication of "every 5 years" not clear? But I already mentioned those links, they're pretty weak. I'm not calling an update that for a few people makes a handful of games crash "botched", when the original implication was quite juicy botching.

Also, if we're actually getting into this, the XP gripe had nothing to do with updates. That's moving the goalposts half a mile in the other direction.

albedoa

0 replies

17h41m

2024-07-29 00:49:16 UTC

why would I care about the past?

??? You specifically asked for it! What are you doing.

squigz

3 replies

19h38m

2024-07-28 22:52:40 UTC

GP is absolutely correct. You can't ask for examples of a long history of something, then dismiss examples from, you know, history.

Rinzler89

2 replies

19h21m

2024-07-28 23:09:02 UTC

Fair enough, but if those examples are irelevant to modern times, what's the point of bringing them up? If we want to keep the discussion relevant to modern context then let's discuss modern history, not obsolete news from 20 years ago.

squigz

1 replies

19h8m

2024-07-28 23:22:40 UTC

What is "modern history"?

lucianbr

0 replies

10h34m

2024-07-29 07:56:20 UTC

A period of time where Microsoft has no mishaps, of course.

clwg

4 replies

20h34m

2024-07-28 21:56:29 UTC

First thing that comes to mind is that Recall stuff from a month ago, they also release updates[0] that crash machines.

[0] https://www.tomsguide.com/news/windows-11-update-causing-blu...

TeMPOraL

2 replies

20h0m

2024-07-28 22:30:48 UTC

Recall actually is a brilliant idea, and I dreamed of something like it for a long time, and so did plenty people here. It's just not something you can trust a third-party business with, whether it's a fly-by-night startup or an international megacorporation known to be openly promiscuous with advertisers.

This is basically "take a screenshot every 30 seconds and compile it into a timelapse", but on steroids, and the same appeal, and arguments wrt. who gets to run it on whose machines, all apply.

dahdum

0 replies

19h27m

2024-07-28 23:03:42 UTC

If you keep your business and personal computing separate, Recall looks amazing.

clwg

0 replies

19h46m

2024-07-28 22:43:56 UTC

The functionality does seem intriguing, that doesn't change it's security profile which was poorly thought out and implemented.

feyman_r

0 replies

19h47m

2024-07-28 22:43:53 UTC

Ignoring Windows Insider reports is bad. However, how many endpoints having issues (out of a billion+) is ‘acceptable’ after an update? We live in a news hype cycle so clearly even the one wrong failure will make it up somewhere.

However, without metrics that show BSoDs from patches (which MS will likely never share), it’s hard to see if things have improved or regressed. If they regressed, someone up in their leadership chain is hopefully following the constructive discussion here.

tacticus

0 replies

20h32m

2024-07-28 21:58:12 UTC

The company that let every db server have global admin creds and 0 logging on their cloud platform?

That didn't run their own enhanced visibility on their own cloud platform.

lightedman

0 replies

19h40m

2024-07-28 22:49:56 UTC

Vulnerabilities present in 2000 are showing up still in modern Windows versions.

https://www.csoonline.com/article/564499/3-leaked-nsa-exploi...

You have no idea the cruft and technical debt Windows has in order to maintain its backwards compatibility.

TeMPOraL

5 replies

20h3m

2024-07-28 22:27:06 UTC

That's a bit disingenuous, though. That was, as 'Rinzler89 points out, some 20 years ago. Back then, any Linux distro would've definitely been much safer option, because after installing you couldn't even connect it to the network, because it had no support for your cable modem or wireless card, and that's assuming you didn't fuck up your MBR with LiLo for the 20th time. Ask me how I know.

Both OS families have changed much since that time.

commercialnix

2 replies

18h25m

2024-07-29 00:05:13 UTC

In 2002 I wasn't yet even out of middle school when I had a Linux distro running all key hardware components "just working". At that time at my school we were taught how to search the web, so I searched the web and looked up what hardware worked. Very simple. All I had to pitch to my parents was, "this system shares its code and encourages me to study it and learn code", which made clear to them what I was asking for wasn't just another video game console. Soon after I had a refurb laptop (fortunately not x86) and a curated WiFi card that ran Linux (and soon after, BSD) with everything "just working".

When I see someone complain about unsupported/unsupportable chips in comments on online forums, especially one dubbed "Hacker News", I am puzzled how I in my middle school years acted out a pattern that is objectively smarter* than what I read in such comments. I also happen to first-hand know I am for sure not the only one with this vantage point. Those who comment about unsupported/unsupportable chips as if it is somehow an open source kernel's fault might want to take a moment to consider how others, and how many others, are viewing such drivel. For every one of us who take the time to point this out, there are 10,000 of us experiencing utter contempt, like as if we just got an unexpected whiff of some hot garbage.

[*]And, I honestly don't think I'm even that smart.

fragmede

1 replies

18h12m

2024-07-29 00:18:40 UTC

you got lucky with the hardware. there was a bunch of wifi cards that wouldn't work in Linux because there were no drivers. and then ndiswrapper came along and let you use windows drivers in Linux. now that was a user unfriendly procedure of getting it working. some chipsets eventually got native drivers like ralink or b53 but getting things working was not easy!

commercialnix

0 replies

15h7m

2024-07-29 03:23:15 UTC

There was absolutely zero luck involved. As I already wrote in the previous comment, I did something very simple. I sought out a WiFi card that already had Linux drivers and then purchased that WiFi card. I didn't have to "do anything" to get the WiFi card working.

rvnx

0 replies

19h51m

2024-07-28 22:39:16 UTC

Oh sweet, this laptop has a PCMCIA Wi-Fi card!

That'd be cool if one day I can get the laptop running on battery and not just on sector.

Let me just setup it.

Wait a second, how do I wake up the screen again and get out of this hibernation stage ?

Why are all the fans stuck in 100% now ?

Errr, first let's see if I can get the trackpad working.

lupusreal

0 replies

19h19m

2024-07-28 23:11:01 UTC

On please, if it were that tough then teenage me never would have managed it. 20 years ago, e.g. 2004 (I first installed it in 2001), installing Linux and getting networked was already user friendly. The only hitch I ever had was figuring out ndiswrapper, but my ethernet cards all worked "out of the box" and installers handled the bootloader without users even having to know what a bootloader was. It's not like 20 years ago was the 90s or something, and the dark days of Windows lasted well into the 00s.

feyman_r

0 replies

20h2m

2024-07-28 22:28:25 UTC

Agree.I also remember those days when it was so hard to get Linux to just boot up and get your display working correctly- it was almost like a rite of passage. It was just proving grounds for how much of an expert you were and the number of hours you spent in front of the PC, just to get things working.

My point is, good and bad memories will always stand out.

GordonS

8 replies

21h11m

2024-07-28 21:19:05 UTC

There's only been a few really bad ones, but Microsoft botch Windows updates quite regularly.

Rinzler89

7 replies

21h10m

2024-07-28 21:20:42 UTC

>but Microsoft botch Windows updates quite regularly

OK, please show us the proof then. If it's as regularly indeed like you claim then it must be documented somewhere as a greppable list.

Tech blogs would have a field day getting traffic on their site by keeping track and documenting on such regular mistakes if they exist.

oxygen_crisis

3 replies

20h9m

2024-07-28 22:21:36 UTC

Here's >100 of them in the past ~8 months:

https://www.manageengine.com/patch-management/resources/micr...

feyman_r

2 replies

19h58m

2024-07-28 22:32:00 UTC

Where can I find a list for all OSes? I’d assume such a list would have known issues with X11 etc. I want to ensure it’s not a case of surviviorship bias.

oxygen_crisis

1 replies

14h47m

2024-07-29 03:43:50 UTC

I don't think there is one... macOS doesn't have enough functionality-breaking updates to make a significant list, and Linux/BSD-based distros generally do cleanly segmented updates to individual apps and services rather than Microsoft's great big monolithic all-or-nothing OS update bundles that touch on dozens of services at the same time.

feyman_r

0 replies

13h44m

2024-07-29 04:46:42 UTC

Here’s a quick 2 minute search on Google for each.

- https://www.macworld.com/article/671831/macos-wont-install-f...

- https://askubuntu.com/questions/1231849/how-to-fix-update-pr...

My own anecdote: When I got my M3 Pro in April and had to start afresh, it was stuck in a restart loop and had to take it to the Genius Bar; they asked me to answer ‘no’ to some question that I was answering differently. That was it. I have no idea on the root cause or why it was fixed this way. I don’t remember the exact screen where the answer was supposed to be different.

Brybry

2 replies

20h33m

2024-07-28 21:57:51 UTC

It's frequent enough that people pay money for AskWoody[1] to tell them when it's safe to patch or what patches to skip.

[1] https://www.askwoody.com/ms-defcon-system/

Rinzler89

1 replies

20h23m

2024-07-28 22:07:49 UTC

Quote, from the website:

"In general, I apply Windows Defender updates as soon as they’re available. Why? Microsoft hasn’t screwed up any of them too badly. You’re better off applying those updates than letting them slide for a week or two."

Brybry

0 replies

20h1m

2024-07-28 22:29:43 UTC

Yep, Microsoft does a good job with Windows Defender (antivirus) updates.

It's the other Windows Updates that they botch frequently enough to make people wary of patching immediately.

system2

4 replies

21h16m

2024-07-28 21:14:42 UTC

Anyone who worked in IT knows this, it is not something rare. Literally every month, for example one from last month:

https://www.techradar.com/computing/windows/windows-11-updat...

This is the main reason every IT professional I know disables auto updates of windows and manually trigger updates after testing (hopefully) on multiple dummy machines on the network.

I personally remember booting to safe mode to remove Windows updates to rescue the computers more than I can count.

Rinzler89

3 replies

21h12m

2024-07-28 21:18:07 UTC

Examples like that one I also found, but that's not really a "looooong list". If people can only show one single example as an argument it's kind of a moot point.

system2

2 replies

20h26m

2024-07-28 22:04:10 UTC

You'd experience at least 3-5 per year if you work in IT. There really is a long list but since it is not my argument, I won't list them after searching for an hour. The list starts early 2000s, not recent.

EDIT: Whatever, I will do the search for you since you cannot use google:

https://www.pcgamer.com/an-odd-bug-in-this-months-windows-10...

https://www.windowslatest.com/2023/10/22/windows-11-october-...

https://www.bleepingcomputer.com/news/microsoft/windows-10-e...

https://www.windowslatest.com/2023/02/09/microsoft-confirms-...

https://www.windowslatest.com/2023/07/16/windows-11-kb502818...

These are just the last quarter of 2023. There is over 2000 news but I won't link them Use keywords: Windows Update, Crash, and use the date option on google go before 2023.

Rinzler89

1 replies

19h28m

2024-07-28 23:02:46 UTC

All you could find were 4 examples in 2023? Hardly a long list, wouldn't you say?

I think my Android updates caused way more issues in one year and that's running an immutable HW that's well know and understood by the manufacturer, so 4 issues per year for Windows doesn't sound too bad, even though I had zero in 2023.

sunaookami

0 replies

13h47m

2024-07-29 04:43:05 UTC

https://en.wikipedia.org/wiki/Moving_the_goalposts

mrj

0 replies

20h55m

2024-07-28 21:35:21 UTC

Well, from the news this morning:

https://www.forbes.com/sites/daveywinder/2024/07/27/microsof...

drdec

2 replies

20h14m

2024-07-28 22:16:05 UTC

> they need guidance on basic QA practices

Microsoft has a loooong history of botched (security) updates, so I'm not hopeful they can teach Crowdstrike much.

Experience is the best teacher

psychoslave

0 replies

7h6m

2024-07-29 11:24:12 UTC

Attention to teacher is not equal between learners, trying to thoroughly assimilate the lesson is not everyone move, self challenging oneself with actual tests to ensure skill acquisition is rare, and going through the whole rabbit hole to figure out what untold assumptions the teacher leverage on and understanding the limits of these suggestions is the way only a few exceptional beings will follow.

justinclift

0 replies

19h30m

2024-07-28 22:59:59 UTC

Is MS doing it properly these days though?

If they are, then you could be right. :)

cogman10

2 replies

19h56m

2024-07-28 22:34:41 UTC

And they've learned a lot from it. For example, MS no longer universally deploys updates across the world, they have a slower rollout to avoid just such an incident.

sunaookami

1 replies

13h49m

2024-07-29 04:41:36 UTC

Yeah now one million users loose access to their computer instead of 100 million!

fragmede

0 replies

13h46m

2024-07-29 04:44:34 UTC

yes? that's 100x better! at the end of the day, internal testing just isn't going to catch every single permutation of customer configuration, so there's always a risk that something bad goes out. if you're that big, you'd start with .01% of the fleet instead of 1% of the fleet, so it's 100_000 before you get to 1_000_000, before going to 100% but neither Apple or Google have figured out a better way than that. It's industry standard at this point.

SoftTalker

0 replies

21h22m

2024-07-28 21:08:12 UTC

Yes, quite the epitome of throwing stones from a glass house.

notepad0x90

33 replies

21h43m

2024-07-28 20:47:37 UTC

I must disagree with that take, your last quoted sentence is in response to all the supposed self-proclaimed experts asking "why does it need kernel access", the ones before that is to limit their own liability.

What I've heard from people in the industry is not this silly "oh no, crowdstrike is so incompetent" b.s. that is being spread on sites like HN and reddit but more of an empathic "it could have been us" sentiment. In this write up as well, Microsoft knows they have caused their share of outages, it is a technical write-up but in part, it is to cover their bases for government investigations and lawsuits that will arise from this incident.

And in part, they are also responsible for recovering from third-party driver errors and repeated boot failures caused by faulty drivers.

retrochameleon

29 replies

21h36m

2024-07-28 20:54:27 UTC

CrowdStrike blamed their test software, but in the same breath revealed that they haven't been using any canary deployments. The bug that caused all this was present in their kernel driver for a long time.

For being such a large cybersecurity player and deploying updates to 8.5 million devices, their quality control practices are embarrasingly lacking.

mort96

13 replies

21h32m

2024-07-28 20:58:20 UTC

Every company I've ever been at rolls out updates slowly. Rolling out a change to 8.5 million computers at the same time seems ridiculous. Even the most cash strapped start-ups with every incentive to cut corners tends to get staged roll-outs more or less right. It's crazy.

geon

6 replies

20h56m

2024-07-28 21:34:13 UTC

I had a fleet of only maybe 200 computers I updated remotely. I did canary staged roll outs.

notepad0x90

4 replies

16h2m

2024-07-29 02:27:58 UTC

not a software update!

mort96

3 replies

8h30m

2024-07-29 10:00:20 UTC

Not relevant!

notepad0x90

2 replies

6h34m

2024-07-29 11:56:02 UTC

details are always relevant in a technical discussion. look at my other comments where i pointed out microsoft performing similar immediate av signature updates and causing chaos.

mort96

1 replies

2h23m

2024-07-29 16:07:44 UTC

Some details are relevant, some are not.

I'm more than comfortable labelling parts of Microsoft as incompetent as well.

notepad0x90

0 replies

46m

2024-07-29 17:44:11 UTC

We can agree on that, but it is relevant because this isn't an unusual practice. Crowdstrike didn't ignore some pre-existing best practice. Lots of things need improving but facts and details matter when you talk about RCA. it isn't about blame but fixing the root cause.

doubled112

0 replies

20h22m

2024-07-28 22:07:55 UTC

When I managed ~ 15 developer’s Arch Linux workstations, I found it very beneficial to be the canary, and then rollout to a couple of the more capable of troubleshooting devs, and then the rest. I can always fix my own box.

8.5M all at once feels insane.

notepad0x90

4 replies

16h3m

2024-07-29 02:27:29 UTC

again, this is why I was snarky in my earlier post, this was not a software update. they should have used canary deployments still but in many cases prior to this incident, it was not acceptable to wait even a few hours because it can make the difference between companies getting ransomwared/hacked, so they focused on making the actual code/driver that interprets the channel file updates robust enough to handle real-time updates. Even if other players were doing canary deployments with behavioral detection updates, they're not the market leader, crowdstrike is for a reason.

Everyone that worked in an operational incident response role has blocked some indicator like an ip address or a domain. you don't do gradual roll outs for those either, and i've seen people cause outages by skipping a check or making a mistake. this is similar in many ways to that except it was for a named pipe. This could probably have waited for a canary deployment, but in general the class of content that is being deployed would be deployed right away, I'd be surprised if their practice is considered "bad" by any measure. I've seen Microsoft also deploy email quarantine signatures and defender updates that caused large scale impacts.

Here is a link of what Microsoft did earlier this year:

https://www.techradar.com/news/google-chrome-not-working-mic...

If they had canary deployments, that wouldn't have happened. I had rules that were causing chaos because of that. Now imagine if defender had a bug that caused it to crash because of a signature update. The impact would be magnitudes greater than what you saw with Crowdstrike. It's really frustrating to see the lack of technical critical thinking and arm-chair experts acting like they know what they're talking about.

mort96

2 replies

8h30m

2024-07-29 10:00:06 UTC

Let's say the driver was "robust enough" to handle a broken channel file. How would that look exactly? Say you're responsible for writing the code which loads a new channel file. These channel files are critical; without them, your security critical product doesn't know how to do its job. The channel file parser returns a parse error. How should the driver respond? Surely you're not going to just silently disable your security critical product if someone puts a bad channel file in there?

PleasureBot

1 replies

4h15m

2024-07-29 14:15:30 UTC

Delete the file or mark it as corrupt so that the parser doesn't keep trying to read it, and send some telemetry back to CS to indicate there is a problem with the one of the channel files. It doesn't seem very complicated at all. There are plenty of options in between "catastrophically crash the OS" and "silently disable the entire product".

mort96

0 replies

2h24m

2024-07-29 16:06:40 UTC

That seems pretty dangerous if that channel file included security critical configuration, which it presumably did

Dylan16807

0 replies

10h19m

2024-07-29 08:11:35 UTC

it was not acceptable to wait even a few hours

Hours... Wouldn't a 15 minute canary have found this problem about 14 minutes before it hit wider deployment?

binkHN

0 replies

21h22m

2024-07-28 21:08:47 UTC

Beyond crazy. I even have a small app that never makes it to production before being rolled out to internal and open testing first. And, even then, it's slowly rolled out to a percentage at each stage before being fully deployed. One would think a major company with kernel level access would do this at minimum.

rvnx

9 replies

21h34m

2024-07-28 20:56:35 UTC

Clearly incompetence to deploy from 0 to 8 million devices without any gradual rollout.

That goes even further, because apparently they were fully blind and didn't have crash metrics.

"Ok we push the update, and pray".

galangalalgol

7 replies

21h29m

2024-07-28 21:01:13 UTC

I think it is past incompetence, and on into negligence. Given the stories we have heard here about emergency service failures it is likely that people died. When people die due to negligence isn't that usually criminal?

SoftTalker

3 replies

21h19m

2024-07-28 21:11:20 UTC

Who is negligent though? Crowdstrike, or the emergency services that are using an OS that requires third party endpoint security right out of the box in order to be safely used, or the company that makes and sells that OS?

crazygringo

1 replies

21h8m

2024-07-28 21:22:36 UTC

Why not both?

Crowdstrike, for negligently not rolling out updates gradually.

And emergency services, if they don't have robust fallback procedures/systems for when their IT system goes down. I mean it's totally fine if regular doctor's visits get postponed, but 911 should never go down just because their computers down. Just like aircraft have redundant systems, so too should 911.

(The company that makes and sells the OS -- I don't see any negligence there, in this case. If security software fundamentally requires running at the kernel level and Microsoft allows that, I don't see how Microsoft can be at fault.)

jmb99

0 replies

20h43m

2024-07-28 21:47:34 UTC

Yeah, I don’t see how one can blame Microsoft in this scenario. If you choose to run buggy kernel-level code, that’s on you, not the publisher of the kernel/OS. Especially when the code you’re running is a replacement for functionality already provided by the OS. It’s hard to argue that MS could be negligent for “not having a good enough AV/endpoint protection solution” or “allowing customers to run kernel-level code.”

Aeolun

0 replies

18h55m

2024-07-28 23:34:58 UTC

It’s hard for people to understand that these massive ‘security’ enterprises are often connected by a large amount of bodies instead of competence.

rvnx

0 replies

21h29m

2024-07-28 21:01:43 UTC

Can't agree more, you found the right words.

notepad0x90

0 replies

15h58m

2024-07-29 02:32:38 UTC

https://www.techradar.com/news/google-chrome-not-working-mic... ,not an unusual practice and they were not first av company to cause outages. and again, it was not a software update, the buggy software was deployed after testing back in march. Details matter!

How about we let the lawyers figure out who had what liability, just like with the av/edr industry, we should know when the subject matter is outside our area of knowledge and expertise.

binkHN

0 replies

21h20m

2024-07-28 21:10:37 UTC

And this is how the lawsuits will start.

notepad0x90

0 replies

16h1m

2024-07-29 02:29:03 UTC

I shared with a sibling commenter:

https://www.techradar.com/news/google-chrome-not-working-mic...

Did Microsoft do a staged or canary roll out with that? This is not a software update, if you're making such comments then you're speaking about something outside of your field of expertise.

duskwuff

2 replies

21h31m

2024-07-28 20:59:20 UTC

CrowdStrike blamed their test software, but in the same breath revealed that they haven't been using any canary deployments.

Their post-incident report [1] also stated that they intend to improve testing by "using testing types such as: local developer testing". One has to wonder what, if any, testing they were doing beforehand.

[1]: https://www.crowdstrike.com/blog/falcon-content-update-preli...

MBCook

1 replies

16h0m

2024-07-29 02:30:47 UTC

Well we know what the testing is, don’t we?

The update literally crashed the system it was used on.

There’s no way they couldn’t know that unless they never ran it. Right?

Is this one of those things that only happened to 10% of users? Because I haven’t seen that reported anywhere.

duskwuff

0 replies

12h35m

2024-07-29 05:55:43 UTC

Is this one of those things that only happened to 10% of users? Because I haven’t seen that reported anywhere.

As far as I'm aware, it affected all systems using Crowdstrike.

lupusreal

1 replies

19h11m

2024-07-28 23:19:35 UTC

Unless their developers had room temperature IQs or were actual psychopaths, I really wonder how they even managed to find developers who had the nerves to deploy to the whole world all at once like that. If it were me I'd be scared shitless, covered in sweat and probably shaking too hard to even type. Were CrowdStrike developers too stupid to even realize the magnitude of what they were doing? Or did they have cooler nerves than an open-heart surgeon? It's shocking to me that they could have done this so casually.

Aeolun

0 replies

18h54m

2024-07-28 23:36:43 UTC

Were CrowdStrike developers too stupid to even realize the magnitude of what they were doing?

More likely they were following a playbook to the letter, and were therefore 100% of success.

michaelt

0 replies

21h27m

2024-07-28 21:03:26 UTC

Anyone in the industry could have a bug get through testing.

Some companies could have a severe and readily reproducible bug get through testing.

A few of those companies have a hand-rolled update mechanism, and can accidentally break their ability to roll back a bad release.

A few of those companies are in a position to push a release that breaks not only their own software, but the entire OS.

Very few companies in that position would roll out to 100% of client machines in a single worldwide deployment.

gjsman-1000

0 replies

21h32m

2024-07-28 20:58:42 UTC

Microsoft should be sued, for literally having blood on their hands. There was an easily mitigated design flaw in Windows that would have greatly blunted the impact.

https://news.ycombinator.com/item?id=41095788

freehorse

0 replies

21h27m

2024-07-28 21:03:09 UTC

If "it could have been them", then I would like to read such professionals write exactly about how to avoid having a global outage like this again, rather than "showing empathy" with a corporation. Or do we just leave it up to luck, and if "it happens to them too" in a month or year, oopsies? What about which practices could be improved?

nimbius

29 replies

18h5m

2024-07-29 00:25:19 UTC

this isnt even the first time its happened. Crowdstrike has killed an OS every month for the past four months.

At this point they are a threat actor. if you havent kicked their amateur-hour software out of your infrastructure by now, chances are good senior management and engineering have at least considered it formally.

https://en.wikipedia.org/wiki/CrowdStrike#Severe_outage_inci...

metadat

20 replies

17h53m

2024-07-29 00:37:09 UTC

That incident list is damning. Is senior leadership asleep at the wheel, or how can this many incidents possibly happen every 30 days for months on end? If leadership really cared, they'd make sure post-mortems and other best practices are in place to reduce the frequency.

Unfortunately, the executive disconnect isn't new. It's actually uncommon that they care about the reality for end users and customers (which is antithical to my entire ethos, hence why I get paid the medium bucks). Why bother waking up and going to work everyday unless you are contributing in some way to sustaining a better future for everyone? It's actually great for marketing and it's already going to be a tough 100+ years from today for our children, even with our collective care.

P.s. People can be so selfish, it kind of breaks my brain but not really. Have you seen the CO2 emissions visualization from NASA this week? It was a wakeup call for me.

'Tremendous' NASA Video Shows CO2 Spewing from US into Earth's Atmosphere https://www.newsweek.com/nasa-video-carbon-dioxide-co2-emiss...

It's concerning.. and caught no traction.. http://news.ycombinator.com/item?id=41064029

swasheck

17 replies

14h46m

2024-07-29 03:44:23 UTC

here’s a fun connection: https://x.com/anshelsag/status/1814426186933776846

“ For those who don't remember, in 2010, McAfee had a colossal glitch with Windows XP that took down a good part of the internet. The man who was McAfee's CTO at that time is now the CEO of Crowdstrike. The McAfee incident cost the company so much they ended up selling to Intel.”

so yeah, “leadership” (and that’s a loose term) doesn’t seem supremely concerned about much more than earnings

valicord

9 replies

14h22m

2024-07-29 04:08:07 UTC

Not to worry, McAfee CTO was not actually in charge of technology

https://archive.is/20240724213623/https://www.barrons.com/ar...

hinkley

8 replies

13h50m

2024-07-29 04:40:11 UTC

The fish rots from the head.

Also what the fuck is a sales-facing CTO??

cratermoon

3 replies

13h18m

2024-07-29 05:12:32 UTC

I'm suspicious of CrowdStrike now. If we rip the cover off would we find that it's little more than a reskin of McAfee?

hinkley

0 replies

3h3m

2024-07-29 15:27:14 UTC

Sometimes it’s good to take a little break after working for a company that ended up not representing your values.

I’m on #2 now and it’s been great. It’s like a breakup. “What was I thinking?”

Of course if it is representing your values and your values are purely mercenary, it’s really not going to change anything.

graycat

0 replies

10h50m

2024-07-29 07:40:19 UTC

The Internet is able to transmit odors of rotting flesh????

Recently ordered an HP laptop for some light work (not my startup), and when placing the order said don't include McAfee, that "I don't trust them", all just from some odor!

CloudStrike runs in kernel mode? No wonder there are problems; kernel mode sounds like more of a threat than a protection.

Sooooo, for my Web server(s), McAfee and CloudStrike are issues I get to ignore. Problems avoided and time, money, energy saved!! Simple.

Natsu

0 replies

11h31m

2024-07-29 06:59:25 UTC

McAfee the company or the person? Because John McAfee was pretty out there...

https://www.businessinsider.com/john-mcafee-tweet-said-his-s...

kermatt

0 replies

3h21m

2024-07-29 15:09:53 UTC

Also what the fuck is a sales-facing CTO??

Perhaps of a symptom of the "Everyone is in sales" brain damage so pervasive in companies now.

joshstr

0 replies

2h27m

2024-07-29 16:03:40 UTC

Have seen region-specific Field CTO roles partner with GTM teams to co-sell with customers. Product and role domain expertise without the organizational technology responsibility.

Wytwwww

0 replies

6h56m

2024-07-29 11:34:34 UTC

I assume T stands for [Sales and Marketing]Technology. Which makes perfect sense because these are their core departments that the whole company is dependant on.

The product itself is a secondary cost-center, probably less important than even accounting.

Terr_

0 replies

11h40m

2024-07-29 06:50:28 UTC

a sales-facing CTO

Is that what happens when a company has so many Sales Engineers that they become a parallel department from regular Engineering?

shmeeed

5 replies

11h19m

2024-07-29 07:11:25 UTC

Now that's interesting. I wonder why neither here nor there anybody mentions GK's name. Fear of litigation?

IMO somebody who managed to collapse the most important infrastructure on earth twice in as many decades - not a small feat, I have to admit - should be known by name to the general public, lest he'll get another chance at it.

prmoustache

3 replies

10h41m

2024-07-29 07:49:13 UTC

I haven't seen any important infrastructure on earth collapse, neither in 2010, neither in 2024.

LadyCailin

1 replies

9h11m

2024-07-29 09:19:08 UTC

Tell that to the people whose surgeries were cancelled because of computer issues.

prmoustache

0 replies

8h54m

2024-07-29 09:36:36 UTC

That was still not on of the most important piece of infrastructure on earth.

And outages were not as global as news outlets made it look to be. Crowdstrike may have been ubiquitous in some countries, but almost absent in others. And still, crowdstrike or windows windows aren't global pieces of infrastructure.

shmeeed

0 replies

7h54m

2024-07-29 10:36:12 UTC

I admit that was a bit of hyperbole. My point stands regardless.

rbanffy

0 replies

9h8m

2024-07-29 09:22:27 UTC

Tech needs something like the FTC that can ban someone from working in that area after multiple demonstrations of glaring incompetence. Or evil misdirection of competence.

cratermoon

0 replies

13h19m

2024-07-29 05:11:07 UTC

McAfee incident 2010 https://www.zdnet.com/article/defective-mcafee-update-causes...

Wytwwww

1 replies

6h59m

2024-07-29 11:31:50 UTC

Is senior leadership asleep at the wheel, or how can this many incidents possibly happen every 30 days for months on end?

Presumably it doesn't matter that much and isn't worth spending money/manpower on?

If the usefulness/quality of their software has no influence on their potential customers decision making process. why bother?

It would make much more sense to allocate any excess resources to the departments that do actually matter like sales and marketing.

maerF0x0

0 replies

4h57m

2024-07-29 13:33:13 UTC

Presumably it doesn't matter that much and isn't worth spending money/manpower on?

Well, if they think any of the $20B of shareholder value lost recently has to do with the quality issues... Then perhaps they should reconsider. (keep in mind marketcap also represents their ability to raise capital in the future with more/less dillution)

wannacboatmovie

1 replies

11h17m

2024-07-29 07:13:13 UTC

From your linked article:

A Hacker News user claimed that

Nice to see Wikipedia has devolved even further into a dumpster fire in that they are now citing random HN posts as authoritative sources of facts.

majewsky

0 replies

5h48m

2024-07-29 12:42:02 UTC

Wikipedia is not an individual actor or a hivemind, so there is no capital-T "They". It's a system of multiple people each acting on their own accord. For a developing news story like this, I find this type of sourcing acceptable, especially because it is cited as "some person on the internet claims", not as "it is true that".

If you disagree with this choice of source, you can flag this part as needing better sources. The simplest way to do so is to just leave a comment on the talk page.

surfingdino

1 replies

10h34m

2024-07-29 07:56:16 UTC

Never assume malice where incompetence will suffice. I have worked on teams where we could not get the basics like a test or integration environments signed off for months yet the managers expected us to go to production. Suffice to say production was also not signed off for half a yer and we had to improvise. I wonder is something similar was at play at CS?

gregw2

0 replies

7h44m

2024-07-29 10:46:45 UTC

Never assume incompetence when greed will suffice.

hinkley

1 replies

13h52m

2024-07-29 04:38:23 UTC

Staffing problems?

Management often sees, “I have a dozen people on this.” When in fact the bus number was three, you laid one off, another quit and the third is sick or having life struggles.

kermatt

0 replies

3h19m

2024-07-29 15:11:43 UTC

"I have a dozen people on a dozen different things."

whiplash451

0 replies

1h12m

2024-07-29 17:18:30 UTC

Or maybe crowdstrike is dealing with the hardest threats and hence ends up having to rollout stuff rapidly against zero-days?

Not a CS fanboy, but just wanted to suggest an alternative to sheer incompetence

jgalt212

0 replies

8h15m

2024-07-29 10:14:58 UTC

this isnt even the first time its happened. Crowdstrike has killed an OS every month for the past four months.

Yeah, but doesn't MS have to sign every kernel mode driver? They've allowed Crowdstrike's foot gun to continue to live in the kernel.

gnfargbl

7 replies

21h30m

2024-07-28 21:00:31 UTC

It didn't read as particularly diplomatic to me. In particular, this paragraph..

> It is possible today for security tools to balance security and reliability. For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement limiting exposure to availability issues. The remainder of the key product functionality includes managing updates, parsing content, and other operations can occur isolated within user mode where recoverability is possible.

...was about as close to tetchy as a post like this would ever get. Basically they are saying "there was no good reason at all why CrowdStrike had to put so much code inside the actual kernel." And with the benefit of hindsight, it's a strong point.

ffhhj

6 replies

21h13m

2024-07-28 21:17:45 UTC

there was no good reason at all why CrowdStrike

Their business is corporate spyware to surveil employees, ofcourse they'll use any tactic to make it work, that's the why. And their EULA states there is no liability for the company:

https://www.crowdstrike.com/terms-conditions/

Dirty policies on top of dirty practices.

Rinzler89

4 replies

20h53m

2024-07-28 21:37:01 UTC

>Their business is corporate spyware to surveil employees

What?! Anything you do on your corporate provided laptop is always gonna be logged by IT for security in every large company everywhere, that's news to nobody, but your company doesn't care that you use your corpo laptop to book your vacation, IT has better things to do than narc on you for that.

If your boss wants to actually spy on you they don't need Crowdstrike, there's other SW dedicated for that depending on the laws in your jurisdiction but that' not what Crowdstrike is for.

If you want complete privacy from your employer, just use your personal machine for your private activities instead of your work laptop, why is this so hard?

userbinator

2 replies

20h36m

2024-07-28 21:54:14 UTC

Speak for yourself. There are still companies who don't treat their employees like idiots and actually trust them. Let's not normalise pervasive surveillance.

Rinzler89

0 replies

20h35m

2024-07-28 21:55:17 UTC

>There are still companies who don't treat their employees like idiots and actually trust them.

Yeah sure, but wow many of those are large non-tech companies?

You massively overestimate the tech competency of the average PC user if you think it's normal in most companies to not have security monitoring solutions in place or over the internat activity. In our latest phishing test IT did, several users fell for the trap, despite it being a tech company. There's always gonna be someone careless one day and companies want insurance policies against that.

Having such solutions in place doesn't mean the company doesn't trust you, it's more like that old Russian proverb, "trust but verify", and for ticking security compliance boxing as an insurance policy.

Everyone makes mistakes, it's only human. So more like, speak for yourself, if you think your internet activity at work isn't logged anywhere.

Aeolun

0 replies

19h21m

2024-07-28 23:09:15 UTC

I think there’s an inflection point where the company has grow so big it becomes impossible to trust every individual employee.

It won’t be about distrusting anyone in specific either, but something will go wrong for which you need to be monitoring every PC to find out what is going wrong.

heraldgeezer

0 replies

19h29m

2024-07-28 23:01:35 UTC

Yep, there are better tools for spying, like Teramind and Aktivtrak.

heraldgeezer

0 replies

19h29m

2024-07-28 23:01:10 UTC

There are better tools for spying like Teramind and Aktivtrak. Crowdstrike would make a bad spying tool. I guess there is remote CMD? And you can like, see all installed programs.

But so can SCCM/Intune from MS or another RMM like Datto that IT uses to manage PCs...

blackoil

5 replies

16h56m

2024-07-29 01:34:14 UTC

MS should have something like Project Zero for Windows applications and drivers. Any app on more than 1-5% PC should be tested and fuzzed and ... And the vendor than pressured into fixing the issues. Even if it is not technically their fault, it is definitely optics problem for MS, half of the world refers it as Windows blue screen issue.

MBCook

2 replies

16h4m

2024-07-29 02:26:41 UTC

And the vendor than pressured into fixing the issues

How would Microsoft apply pressure? Short of publicly shaming them what power do they have?

blackoil

1 replies

14h17m

2024-07-29 04:13:44 UTC

umm. Give a x days deadline and make after it public like Project 0 works, threaten to take away "Verified by MS" badge or create a WhatsApp group of Fortune 500 CIOs and badmouth in it.

9dev

0 replies

7h59m

2024-07-29 10:31:41 UTC

Both of these have legal percussions: Microsoft could very well be called a competitor of CS, so they cannot force them to do something without getting accused of abusing their market position; and a publicly traded company badmouthing another publicly traded company with an awfully complex web of mutual investments is a very bad idea in general.

It’s not that easy.

fragmede

1 replies

16h21m

2024-07-29 02:09:08 UTC

Raymond Chen: That Time We Bought EVERYTHING at Egghead.

https://youtu.be/6m_Im7J9Iaw?si=q8jLBefEdgm-PrrZ

pimlottc

0 replies

15h53m

2024-07-29 02:37:52 UTC

Blogpost version: https://devblogs.microsoft.com/oldnewthing/20050824-11/?p=34...

naasking

1 replies

8h52m

2024-07-29 09:38:33 UTC

People wouldn't need CS if Windows was better designed to begin with...

rty32

0 replies

2h49m

2024-07-29 15:41:46 UTC

Care to elaborate?

How would a better designed Windows eliminate the business & compliance need for installing software like CS? And why hasn't that already happened?

I would think Microsoft and CS' customers have an incentive to not have such third party software on their system if possible.

lupusreal

1 replies

19h29m

2024-07-28 23:01:04 UTC

Why are they being diplomatic, instead of plainly stating their contempt and revoking CS's driver/etc signing keys? Doing so would help to repair the reputational harm that CrowdStrike inflicted on Windows.

Are their lawyers telling them they can't impede CrowdStrike even though CrowdStrike is breaking Microsoft's product? They should do it anyway and dare CS to take it to court so they can publicly humiliate CS by dragging all the dirty details of their incompetence out.

Aeolun

0 replies

19h15m

2024-07-28 23:15:20 UTC

People are free to install kernel modules. It shouldn’t be up to microsoft to stop them from doing so.

cratermoon

1 replies

13h14m

2024-07-29 05:16:24 UTC

Microsoft tried to push back on vendors wanting kernel access in 2006 <https://arstechnica.com/information-technology/2006/10/7998/>

Microsoft has (somewhat correctly IMNSHO) pointed at the EU agreement that forced them to open the kernel up to third parties as being a factor in the CrowdStrike catastrophe. <https://www.theregister.com/2024/07/22/windows_crowdstrike_k...>

oneeyedpigeon

0 replies

10h56m

2024-07-29 07:34:15 UTC

From the latter:

However, nothing in that undertaking would have prevented Microsoft from creating an out-of-kernel API for it and other security vendors to use. Instead, CrowdStrike and its ilk run at a low enough level in the kernel to maximize visibility for anti-malware purposes. The flip side is this can cause mayhem should something go wrong.

The Register asked Microsoft if the position reported by the Wall Street Journal was still the IT titan's stance on why a CrowdStrike update for Windows could cause the chaos it did. Redmond has yet to respond.

thebytefairy

0 replies

3h42m

2024-07-29 14:48:19 UTC

It's a little ironic they are taking the high ground on safe rollout practices when they had an Azure/365 outage caused by a bad config at the same time as the CS incident. Though to be fair, it only affected US central.

gjsman-1000

21 replies

21h53m

2024-07-28 20:37:48 UTC

Reminder that Microsoft could have programmed Windows to notice if a driver has caused a blue screen three times in a row, and prompt if you want to disable the driver on boot. After all, Windows already collects how many times a driver causes a crash. This would have made recovery one click instead of heading into Safe Mode and needing BitLocker keys.

But they didn’t.

And Microsoft, I argue, also has blood on their hands for every hospital this hit. Giving users a prompt to disable the driver, after three successive failed boots, would have saved lives.

t-writescode

8 replies

21h51m

2024-07-28 20:39:19 UTC

How would that have helped the server farms that were experiencing the issue?

gjsman-1000

4 replies

21h48m

2024-07-28 20:42:20 UTC

Oh I don’t know, the servers down, you go and look as a technician, and you simply see a screen saying:

“CSAgent.sys has caused a failure to boot three times in a row. Do you want to disable this driver? <Yes> <No>.”

You click “Yes.” Server reboots with CloudStrike driver disabled. The day is saved in 5 minutes instead of building a custom ISO image or going on a BitLocker key recovery spree.

politelemon

3 replies

21h45m

2024-07-28 20:45:46 UTC

It would still have required on site presence and interaction during which there is still downtime, so this accomplishes marginally small gains.

gjsman-1000

2 replies

21h42m

2024-07-28 20:48:18 UTC

At the same time though, imagine you woke up and CloudStrike hit your organization.

For most users, they’ll try clicking “Yes.” And then it’s back to work. After all, “No” just causes a blue screen again, might as well try the other path.

This would have been the difference between the IT department handling 10,000+ calls or a few hundred (plus sending out a bulletin) in many, many organizations. It also could have saved billions at this point.

Heck, it would have saved lives in hospitals.

jonathantf2

0 replies

20h28m

2024-07-28 22:02:50 UTC

But then you have millions of endpoints booting without malware protection

echoangle

0 replies

19h39m

2024-07-28 22:51:48 UTC

Can you cite some reports of deaths caused by the outage?

morkalork

2 replies

21h37m

2024-07-28 20:53:25 UTC

Instead of prompting on the screen, disable the driver and boot directly into a recovery state that has networking enabled so sysadmins can push scripts and fixes? As long as it's not a network driver you'd be okay.

t-writescode

1 replies

20h7m

2024-07-28 22:23:09 UTC

Disable the driver that is explicitly there to protect from malware and attacks?

Wouldn’t malware just use that as an attack vector?

morkalork

0 replies

19h22m

2024-07-28 23:08:45 UTC

Nooo you don't understaaaand kernel code is special :'( actually BSOD was a desired feature because CrowdStrike is a Security (TM) application

crazygringo

3 replies

20h57m

2024-07-28 21:33:26 UTC

Do I like your idea for that?

Yes, absolutely. It's a clever idea.

But do I think Microsoft was negligent in not building that?

No, I think that's going too far. Windows already has Safe Mode -- as you note -- to allow for manual recovery, which is what people are using.

I don't think it makes sense for it to be Microsoft's legal responsibility to protect its users from software with a critical bug that wasn't written by Microsoft. Otherwise, where would it end? If a third-party program tries to delete all your user data, is it Microsoft's legal responsibility to check whenever a process is deleting a lot of data, and intervene with a confirmation dialog? Is it Microsoft's responsibility to protect you from all malware and ransomware, no matter how cleverly written? Is it Microsoft's responsibility to constantly cache program state on disk so that when a third-party program crashes, you don't lose your data since your last save?

I think that's going too far, in terms of legal obligation.

grumpyprole

2 replies

20h22m

2024-07-28 22:08:52 UTC

Microsoft may be negligent in selling a product unsuitable for these applications. Windows is unsuitable precisely because it can be brought down by third party updates, such that it cannot recover without manual intervention by technical experts. Third party vendors are forced into writing unsafe kernel drivers because Microsoft does not provide sufficient user mode APIs.

Windows has a dated design and a security model no longer fit for purpose. As for your other example, it could be protecting users from malicious programs that may delete data, simply by having a better security model, like Android and iOS.

crazygringo

1 replies

19h57m

2024-07-28 22:33:35 UTC

I don't think Microsoft can be negligent here, because Windows isn't being brought down by Microsoft updates.

Somebody bought Windows, and bought CrowdStrike. CrowdStrike is negligent, and possibly also the person/org who chose to rely on Windows+CrowdStrike without a backup plan if that resulted in further damages to others.

Third party vendors are absolutely not "forced into writing unsafe kernel drivers". They can properly test things to write safer code (which CrowdStrike infamously didn't). And kernel mode is fundamentally required for security software like this, as far as I understand.

And using app-based mobile OS's is not necessarily a useful comparison point. They are limited in all sorts of ways that desktop OS's are not -- and don't you hear people here on HN constantly complaining about that? A better comparison point is macOS and Linux. CrowdStrike also crashed Linux, and macOS still lets you bypass SIP if you want to.

grumpyprole

0 replies

11h42m

2024-07-29 06:48:12 UTC

Third party vendors are absolutely not "forced into writing unsafe kernel drivers".

And kernel mode is fundamentally required for security software like this, as far as I understand.

These are conflicting points. They cannot both be true.

Uvix

1 replies

21h28m

2024-07-28 21:02:03 UTC

Those hospitals chose to deploy software that didn't support testing. The blood is on their own hands.

goosejuice

0 replies

2h56m

2024-07-29 15:34:08 UTC

This is how I feel.

If you're blindly installing software system wide, that has kernel access no less, and not accounting for failure of that software in your risk analysis then you are to blame more so than the vendor.

Certainly I expect some SLA is in place but that's only of monetary benefit and irrelevant to keeping critical infra online.

ziml77

0 replies

20h15m

2024-07-28 22:15:01 UTC

Imagine I've installed CrowdStrike under the assumption that it makes my system more secure. Why would I want the OS to allow the system to boot up in a less secure state by providing a prompt for that? Most users will just click whichever option gets them back up and running and IT will have no control over that.

sudosysgen

0 replies

19h25m

2024-07-28 23:05:41 UTC

Windows does do that by default. If it fails to boot, it will start an "Automatic Repair" screen and it will offer to disable drivers (ie: Safe Mode), or sometime just disable the driver itself.

The problem is that CrowdStrike doesn't want to let you start the computer without it running. It's the reason why it's an ELAM driver - it's marked as required for boot, so Windows won't try to boot without it, much like it won't if you remove crucial hardware drivers. I guess what they are trying to avoid is malware crashing the kernel driver which then gets disabled letting the malware roam free, without realizing that the cure is worse than the disease.

phendrenad2

0 replies

12h55m

2024-07-29 05:34:58 UTC

So attackers just have to find a way to trick the drivers into crashing 3 times and they have full access to the system without the pesky security systems in place? Nice!

nerdjon

0 replies

20h7m

2024-07-28 22:23:33 UTC

This is very much a “easier said than done” situation that I would think Hacker News of all places would be better about when it comes to “just” doing something in code.

First Windows already does something similar. After 3 it is supposed to boot into WindowsRE which gives you options to revert to a previous version, uninstall updates, and I believe also reverts configurations like recent driver installations.

The problem here though, CrowdStrike itself didn’t update. It updated a definition file (last I saw at least) and that likely would not have been caught by Windows as a new version.

Also frankly, not super thrilled at the idea of Windows just deciding to disable/uninstall something except for rolling back (so a previously working config) due to how things could interact. This situation could have been far worse and harder to recover from.

In this case maybe Windows could have noticed that the configuration update is what was causing it and rolled that back, but it’s possible it would have just re-downloaded the file when it started back up anyways.

Regarding saved lives, do we actually know that anyone’s lives were lost due to this? My local hospitals were still performing emergency surgery.

galangalalgol

0 replies

21h21m

2024-07-28 21:09:15 UTC

I think sueing MS for the behavior that ensued when people installed a rootkit directly into the kernel and opened all the ports on their network to let that rootkit get used, is... excessive. Both MS and CS should have had a fail to previous good kernel ability, but the negligence here is clearly with CS for not even trying a blank data file in the automated tests for a piece of safety critical software, and then not using canary deployments before pushing to millions of devices.

Khaine

0 replies

20h30m

2024-07-28 22:00:14 UTC

AFAIK Windows does do that, except for drivers that are marked as required for boot. CrowdStrike's drivers are marked as required for boot.

janice1999

17 replies

21h59m

2024-07-28 20:31:52 UTC

At least they're not blaming the European Union in this breakdown (as they did earlier).

strombofulous

6 replies

21h40m

2024-07-28 20:50:44 UTC

Would this still have happened if the EU had not ruled against Microsoft?

PlutoIsAPlanet

3 replies

21h35m

2024-07-28 20:54:58 UTC

Microsoft can kick security vendors out the kernel, but they can't sell a product that uses APIs not accessible to other vendors.

strombofulous

2 replies

21h34m

2024-07-28 20:56:30 UTC

Sure, but my question still stands - would this have happened if the EU had not made that ruling?

mort96

0 replies

21h32m

2024-07-28 20:58:54 UTC

Probably

Tuna-Fish

0 replies

21h11m

2024-07-28 21:19:20 UTC

Yes. There were kernel mode drivers before that ruling, it is essentially entirely irrelevant to this outage.

holsta

0 replies

21h33m

2024-07-28 20:57:13 UTC

It's not about kernel access, it's about equal access to avoid yet another monopoly.

Microsoft could have come up with a kernel API that their own malware (and everyone elses) product could make use of. They did not.

extraduder_ire

0 replies

20h47m

2024-07-28 21:43:35 UTC

Probably not, but in more of a butterfly-effect or this product not existing way.

ziml77

4 replies

21h8m

2024-07-28 21:22:16 UTC

But the blame wasn't misplaced before. People keep saying that macOS does things better by forcing third parties out of the kernel and instead offering APIs to do the same work in userspace. Microsoft tried to do exactly this for security software in Windows, but the EU didn't like that this change meant that any Microsoft-developed solutions would have an advantage over third party ones.

ronsor

1 replies

20h56m

2024-07-28 21:34:31 UTC

I really, really wish Microsoft would force third parties out of the kernel.

tacticus

0 replies

16h14m

2024-07-29 02:16:48 UTC

They can. They just have to have the same rules for their products in that space.

tacticus

0 replies

20h2m

2024-07-28 22:28:23 UTC

Microsoft tried to do exactly this for security software in Windows

Using a monopoly in one industry to capture the market in another industry is what anti monopoly laws are meant to prevent.

Microsoft was prevented because they wanted to retain a commercial business in their security products having special access while locking out everyone else.

Khaine

0 replies

20h29m

2024-07-28 22:01:30 UTC

No, the EU didn't like MS having their malware protection in kernel while kicking out third parties.

If Defender was also kicked out, it would have been fine, but it wasn't.

whimsicalism

3 replies

21h49m

2024-07-28 20:41:38 UTC

they’re right though…

DarkNova6

2 replies

21h35m

2024-07-28 20:55:04 UTC

Yes. Only Microsoft should be allowed to crash their operating system. Like back in the good old days when only MS could use their secret high-performance APIs.

graeme

1 replies

21h19m

2024-07-28 21:11:18 UTC

Why exactly should security vendors have the ability to crash the operating system?

dmattia

0 replies

21h2m

2024-07-28 21:28:53 UTC

They shouldn't. Microsoft should have APIs that enable security vendors to work in userspace.

The EU didn't say that Microsoft couldn't kick vendors out of the kernel, just that they couldn't do so without having the APIs available that would let security vendors operate outside the kernel.

Mac and Linux have such APIs, so CrowdStrike operates in user-mode on those platforms, so those platforms do not give security vendors the ability to crash the operating system.

zh3

0 replies

21h56m

2024-07-28 20:34:43 UTC

Even this is written after multiple reviews by corporate lawyers.

dmattia

13 replies

22h0m

2024-07-28 20:30:51 UTC

I suppose I was expecting something more authoritative here. They confirm that there was an attempted read-out-of-bounds, as CrowdStrike said, but that's not really new information at this point. I suppose we'll need to wait for more detailed analysis from CrowdStrike at some point.

This post explains why security software has historically run in kernel-mode, and really seems to be pushing new technology that Microsoft has that would push security vendors into user-mode (with APIs that attempt to assist with many of the reasons why they have historically used kernel-mode).

Crowdstrike already runs in user-mode on both Mac and Linux (from what I can tell), and it seems like running in user-mode on Windows would significantly lessen the risk of catastrophic failures like a blue-screen-of-death. I know the bulk of the failures here belong to CrowdStrike, but I can't help but think about the fact that Apple kicked security vendors out of kernel-mode a ways back, and that if Windows had done similarly, an issue like this probably wouldn't have been possible. By even offering kernel-mode options to external vendors, I believe Microsoft is creating risk for themselves.

TillE

3 replies

21h40m

2024-07-28 20:50:02 UTC

pushing new technology that Microsoft has that would push security vendors into user-mode

This doesn't exist. It's briefly hinted at in their conclusion, but right now it's simply not there.

There is no userspace equivalent of filesystem minifilters, ObRegisterCallbacks, etc.

dmattia

2 replies

21h13m

2024-07-28 21:17:53 UTC

This is fascinating, thank you for the info! If I am understanding, it would have then been difficult/impossible for CrowdStrike to create a user-mode only sensor without these equivalent APIs.

So I guess I'm not sure I see validity in the claims of those blaming the EU here. It seems as though the EU would have allowed Microsoft to kick users out of kernel-space if they had APIs that allowed making security products in user-space. Like Linux/Mac already appear to have.

extraduder_ire

1 replies

20h52m

2024-07-28 21:38:33 UTC

I don't think they would have had to provide those APIs in the EU, so long as their own security products were "kicked out" as well. That's kind of complicated to achieve in a permanent and provable way. Though, windows has had support for eBPF for about two years now.

TillE

0 replies

19h45m

2024-07-28 22:45:52 UTC

Windows eBPF support is experimental and currently provides hooks for packet filtering stuff and nothing else.

I would be delighted if their long-term solution is eBPF which provides full anti-malware hooks, but again it's unfortunately not there yet.

GordonS

2 replies

21h8m

2024-07-28 21:22:25 UTC

For one thing, being difficult to kill is huge selling point for EDR - move it to user space and it's a lot easier to kill.

pas

1 replies

19h44m

2024-07-28 22:46:31 UTC

A kernel-space watchdog (that checks integrity of the image) would be much easier than a filter that updates from the internet.

Sure, the whole thing is definitely a hard problem, but CS fucking up even the most basic QA **and** error handling ... it just shows how ridiculous their whole claim to having super fancy technology is.

__MatrixMan__

0 replies

5h0m

2024-07-29 13:30:04 UTC

Agreed, but focusing on their QA practices is sort of like criticizing your burglar for not wiping their feet at the window.

whimsicalism

1 replies

21h49m

2024-07-28 20:41:16 UTC

The EU requires MS to provide kernel-level access to security vendors due to their crazy anti-compete provisions

dmattia

0 replies

20h28m

2024-07-28 22:02:21 UTC

This seems to be only partially true when I read into it. The EU said that Microsoft would need to move their security tools into user-space (or at least to use the same APIs as are available in user-space). If they did that (like Apple has done), they could kick everyone out of kernel-space if they wanted.

Rinzler89

1 replies

21h54m

2024-07-28 20:36:54 UTC

> I can't help but think about the fact that Apple kicked security vendors out of kernel-mode a ways back, and that if Windows had done similarly, an issue like this probably wouldn't have been possible

Like others already said, Microsoft already tried to do that with PatchGuard in 2006 with the launch of Windows Vista and the likes of Symantec and McAfee complained to the EU about this would harm the sales of their products, so the EU told Microsoft to not do it in 2009[1].

Apple has the luxury of a small market share on the desktop PC space to not attract the attention of the regulators, plus a user base that's used to Apple constantly rewriting the OS, deprecating APIs, switching CPU architectures, etc. without giving a fuck about breaking backwards compatibility or cutting off developers access to OS features their products use and getting away with it, luxuries that Microsoft doesn't have.

IMHO, sticking with Window's default security and not using third party anit-malware has made Windows vastly more secure and rulabile than it was in the days when you'd be looking on installing the likes of Symantec or McAfee for your "protection" which ended up acting like malware after a while throwing dark patterns at you to milk more subsection fees, so as much as it hurts their sales, it's important for the regulators to understand that security is far more important than the regulations they put on Windows for Internet Explorer and Media Player and just like Apple's apps-store, it's sometimes better to let the original product maker handle security and not leave the product open at all points just so some of these bandits can make a living selling security for it. It's like foxes complaining to regulators how chicken wire is a threat to their existence.

[1] https://stratechery.com/2024/crashes-and-competition/

nopcode

0 replies

6h53m

2024-07-29 11:37:04 UTC

Microsoft sells endpoint security products and it would be unfair if third party solutions couldn't leverage the same APIs, it makes a lot of sense that a regulator steps in. I'm not aware of Apple selling security products or competing with third party security products.

michaelt

0 replies

21h21m

2024-07-28 21:09:53 UTC

> Crowdstrike already runs in user-mode on both Mac and Linux (from what I can tell),

Crowdstrike provides a Linux kernel module, and expects users to manually install an extra Secure Boot key for it, as part of their corporate laptop setup procedure.

This has always seemed inadvisable to me, but checkbox checkers gotta check checkboxes I guess.

__MatrixMan__

0 replies

21h39m

2024-07-28 20:51:10 UTC

I agree. Microsoft's core competency has traditionally been backwards compatibility, but if each security vendor can tamper with windows at the deepest level and is allowed to continue explore all of the ways that they can leverage that... What you end up with is a fleet of different windowses, each diverging further with time. It dilutes the benefits brought by investment into the stability of the system because whatever fights are won in one fragment must be refought in others before you can have confidence in the stability of all fragments.

It seems like madness to me.

tonymet

9 replies

20h21m

2024-07-28 22:09:28 UTC

Did either release from MS or Crowdstrike explain how this crash bypassed QC? I'm still baffled that a 100% repro crash even made it anywhere near the later stages of QC. This is something easily caught by the earliest CI phases , at the developer and at least first build automation phase, let alone human QC.

magicalhippo

4 replies

20h4m

2024-07-28 22:26:10 UTC

From what I read in the previous thread, their test environment didn't actually test what was deployed.

That is, there was a post-test pre-distribution packaging stage, and that's where the distributed file(s) got f'ed up.

If true that would explain how it got past their testing, but would also be an incredible lack of competence IMHO.

But yeah, curious if there's been some more concrete details there.

tonymet

3 replies

19h47m

2024-07-28 22:43:11 UTC

I heard something similar. that they deploy content separately from code, but they don't test all of the combinations of code + content. This crash was from "stable" code in the driver mixed with a corrupt or incomplete content file (config, etc) , triggering the null-ptr exception .

Sounds like one of those companies where you get hired and are shocked by the sausage factory you just stepped into

rvnx

1 replies

19h40m

2024-07-28 22:50:44 UTC

In February they added new code that allows to spy/block named pipes.

Named pipes are pipes of communication that processes can use to talk to each other, as an alternative to sockets.

For example Chrome uses them between the user interface and the actual page renderer.

In March they tested it in staging, said it was fine, pushed to prod with few rules in April, still looked fine.

In July they added a new rule, which was deployed to 100% immediately, as from their perspective, a new entry in a database definition doesn't need testing nor canary deploy

(which is still irresponsible, because bad rules could cause damage as well like any security/antivirus software, even if the parser didn't crash, but it could have blocked legitimate actions or files)

tonymet

0 replies

15h49m

2024-07-29 02:41:07 UTC

great summary thanks for the details. I hope more companies see this and consider adding more test diagnostics

magicalhippo

0 replies

14h8m

2024-07-29 04:22:12 UTC

Right, so it seems two egregious errors: no (or highly lacking) fuzzing of kernel modules acceping arbitrary input, and no testing of configuration changes given it's ingested by a kernel module.

pas

2 replies

19h41m

2024-07-28 22:49:43 UTC

lack of fuzzing for their "parser + updater"

MBCook

1 replies

15h54m

2024-07-29 02:36:44 UTC

Plus parsing unknown files (as in not validated to be properly formatted) in kernel space is just asking for a crash.

pas

0 replies

9h31m

2024-07-29 08:59:12 UTC

parsing is real validating!

(obligatory "Parse, don't validate" post

https://news.ycombinator.com/item?id=35053118

and some more discussion on the topic

https://news.ycombinator.com/item?id=39322551 )

duped

0 replies

5h45m

2024-07-29 12:45:47 UTC

My understanding is that they ship precompiled, templated scripts. The "content" updates fill in these templates with configured values. They test the templated scripts, and they validate the content, but they don't validate the content bound to the script. The garbage content was apparently valid but its behavior when used was not.

Their language for describing this design is obtuse and confusing.

jacobgorm

8 replies

22h5m

2024-07-28 20:24:57 UTC

I used to work on Control Flow Integrity (CFI/XFI) research at places like MSR Silicon Valley and VMware, as far back as 2006. Back then, sandboxing a kernel module like ramdisk.sys was doable with a lot of binary rewriting magic, and later with custom LLVM passes, but nowadays it should be a simple matter of compiling the code with clang and the appropriate flags, to completely rule out this type of memory safety error, turning a BSOD into a polite log message and disabling the faulty driver.

torginus

3 replies

21h55m

2024-07-28 20:35:02 UTC

from what I understand, CrowdStrike has essentially put a Turing-complete interpreter for their scripting language into the kernel. I doubt you can do much when something is that general purpose.

magicalhippo

0 replies

20h16m

2024-07-28 22:14:02 UTC

Lua has been used in Linux kernel modules[1][2]. At least for the ZFS case I know they were satisfied with the ability to limit what the Lua scripts could do to avoid issues.

[1]: https://lwn.net/Articles/830154/

[2]: https://openzfs.github.io/openzfs-docs/man/master/8/zfs-prog...

jacobgorm

0 replies

20h38m

2024-07-28 21:52:44 UTC

It doesn’t matter if you are doing full Fault Isolation with XFI. I recommend reading the paper here https://www.usenix.org/legacy/event/osdi06/tech/full_papers/...

capitainenemo

0 replies

21h45m

2024-07-28 20:45:41 UTC

Do you have more information on that? Hadn't read anything about the CS kernel module running arbitrary code. Was it a factor in the crash?

'course, Microsoft also put turing complete scripting in ring 0 years ago for performance reasons (TTFs - XML/HTML parsing and GUI rendering too - to beat other OSes apparently) and that certainly did lead to exploited vulnerabilities...

https://googleprojectzero.blogspot.com/2016/07/a-year-of-win... https://gist.github.com/Nevor/ed3719dad0cf66893e42a9ba024c91... https://learn.microsoft.com/en-us/security-updates/securityb... https://www.fortinet.com/blog/threat-research/one-bit-to-rul... https://learn.microsoft.com/en-us/security-updates/SecurityA... https://news.ycombinator.com/item?id=9769099 (this comment in particular https://news.ycombinator.com/item?id=9783863)

pcwalton

3 replies

21h59m

2024-07-28 20:31:08 UTC

I mean, this is basically what eBPF accomplishes in Linux.

gclawes

2 replies

21h8m

2024-07-28 21:22:21 UTC

There is eBPF for Windows: https://github.com/microsoft/ebpf-for-windows

I'd hope security products in the future leverage this more than custom kernel-mode sensors.

capitainenemo

1 replies

20h59m

2024-07-28 21:30:58 UTC

Was discussed on HN last week. Top comment notes the Windows support is still very limited. https://news.ycombinator.com/item?id=41033579

capitainenemo

0 replies

18h27m

2024-07-29 00:03:12 UTC

... oh. and the article by Brendan Gregg in the HN link above had the telling phrase: "Once Microsoft's eBPF support for Windows becomes production-ready"

akira2501

7 replies

21h59m

2024-07-28 20:31:25 UTC

where security and availability are non-negotiable.

Yep. You just have to pretend that everyone who deployed Windows had an actual competitive choice available to them.

A second benefit of loading into kernel mode is tamper resistance.

I guess availability is negotiable after all.

qsdf38100

6 replies

21h47m

2024-07-28 20:43:38 UTC

Yep. You just have to pretend that everyone who deployed Windows had an actual competitive choice available to them.

Could you elaborate? How is that related to security and availability being non negotiable?

akira2501

5 replies

21h27m

2024-07-28 21:03:38 UTC

Microsoft's statement implies that people choose Windows because of it's security and availability. Whereas most people end up with Windows because the software they want to run only operates on that single platform.

The security and availability, to the extent they even exist, are clearly not part of the market's decision making process.

jojobas

2 replies

17h28m

2024-07-29 01:02:54 UTC

The critical infrastructure that people actually cared about (ATC for example) had all the choice in the world. So did people designing bespoke POS systems. To rephrase the old IBM trope, "nobody got fired for choosing Windows".

akira2501

1 replies

13h49m

2024-07-29 04:40:56 UTC

critical infrastructure that people actually cared about

Did hospitals have a choice in which OS their MRI machine runs? Are those not "critical?" Or should we just not "actually care?"

ATC for example

Were they impacted by this outage? Isn't the reason flights were canceled is because the _carriers_ systems are the ones that had issues?

So did people designing bespoke POS systems.

Really? I would assume most of that is down to the hardware, like cash drawers, credit card readers and order printers, which is likely third party and proprietary, and is only available and supported on Windows. Do you have evidence otherwise?

"nobody got fired for choosing Windows".

You've recognized the same outcome I have but have gone to great lengths in an attempt to obscure the reasons for it happening. Why?

jojobas

0 replies

13h33m

2024-07-29 04:57:11 UTC

There are Unix-based MRI machines. When buying one, I would imagine the OS is rarely a consideration.

There were reports of airports disallowing landings of incoming planes due to controllers being unable to provide separation.

There are card readers available for Linux, it has been long standardized. Cash drawer is a single solenoid.

Major supermarket chains order and supervise delivery of custom POS solutions integrated the way they want from companies like NCR.

NCR offers Windows, Android and Linux POS bases, but supermarkets tend to choose Windows.

manquer

1 replies

13h11m

2024-07-29 05:19:16 UTC

only operates on that single platform

Nope, they aren't ready to pay for another platform that is their choice . If customers paid for linux or mac support there is no shortage of developers ready to cater to that.

Unwilling to pay for multiple platforms is still a choice

Dylan16807

0 replies

10h5m

2024-07-29 08:25:43 UTC

If customers paid for linux or mac support there is no shortage of developers ready to cater to that.

If enough of them do. A single customer can't convince companies to add that support by paying a reasonable amount.

Since there's no collective bargaining happening, this does not show a lack of willingness to pay.

Also, you're ignoring all the software that is already paid for and often doesn't have developers any more.

waterTanuki

5 replies

15h43m

2024-07-29 02:47:12 UTC

I am still to this day gobsmacked how a company the size of Microsoft doesn't do all of it's security in-house like Apple, which locked down kernel access to macos some time ago. The blame is mostly on CrowdStrike, but Microsoft does share responsibility in allowing third-parties to pepper the kernel with whatever code they want to.

breadwinner

3 replies

15h40m

2024-07-29 02:50:16 UTC

Agree... but Microsoft is pointing the finger at EU (the same people that made you click "Accept cookies" on every website).

https://www.neowin.net/news/microsoft-points-finger-at-the-e...

yawaramin

2 replies

13h5m

2024-07-29 05:25:42 UTC

The EU didn't make you click 'Accept cookies' on every website, websites decided to interpret EU regulations as 'Let's make it super annoying for users to not accept, so they accept out of laziness'.

breadwinner

1 replies

5h9m

2024-07-29 13:21:24 UTC

websites decided to interpret EU regulations..

Does it matter? Ultimately, the impact to the citizenry is what matters.

ParetoOptimal

0 replies

3h9m

2024-07-29 15:21:53 UTC

It does matter because you spin things to make it sound like the EU is anti-consumer when they are frequently the ones improving consumer protections and privacy.

motohagiography

0 replies

15h6m

2024-07-29 03:24:49 UTC

the trade off is either you make your own hardware, or make exceptions to integrate with other OEMs, which causes the fragmentation problem that security vendors exist to provide solutions for. Apple doesn't have that fragmentation problem, but they also don't have as rich an enterprise ecosystem. Android is the definition of a fragmentation problem, where instead of resolving it, Google manages it and calls it an ecosystem, and they're right.

Fragmentation makes consistent governance/security impossible, but the heterogeneity also limits the scope of incidents. Apple balances its greater monoculture risk with deep control of the underlying hardware, to where as a user or developer you're only ever interacting with a very, very high level abstraction.

superposeur

5 replies

20h30m

2024-07-28 22:00:30 UTC

I’m surprised no one has yet noted that Microsoft itself is a chief CrowdStrike competitor.

tonymet

4 replies

20h22m

2024-07-28 22:08:24 UTC

i thought crowdstrike provided features that go beyond windows defender. is there another MS product that competes?

superposeur

2 replies

20h12m

2024-07-28 22:17:59 UTC

FWIW, here is CrowdStrike’s own comparison of features:

https://www.crowdstrike.com/compare/crowdstrike-vs-microsoft...

ryukoposting

1 replies

16h16m

2024-07-29 02:14:29 UTC

Interesting that they bill Defender as "requiring frequent OS updates," alleging that their solution is somehow better in that regard. Are the suggesting that by installing CrowdStrike, you don't need to update Windows anymore? It really reads like that.

aaronmdjones

0 replies

9h36m

2024-07-29 08:54:18 UTC

That's hilarious considering that it was a frequent CrowdStrike update that resulted in this chaos.

abhinavk

0 replies

14h59m

2024-07-29 03:31:01 UTC

There is a paid version called Microsoft Defender for Endpoint.

Animats

3 replies

17h54m

2024-07-29 00:36:16 UTC

So how did this kernel level driver get through WHQL verification? The Static Driver Verifier should have caught this.[1] Do some security vendors get to bypass that? Microsoft is very quiet about that.

That's the sort of thing a negligence lawyer focuses on. Partner at Brown Rudrick: "The most likely legal theory will be one of negligence. [Congress] will drag the guy over the coals, they'll maybe implicate him and his company and put in place a negligence action. There'll maybe be a couple of plaintiffs lawyers who dig up some exceptional theory on negligence, and get some class action lawsuits going. Again, we still don't know all the facts in this case, and there are other dimensions which have not yet been fully explored, including how CrowdStrike had access to kernel level updates on the Microsoft operating system? How come Microsoft didn't have any control over these updates being pushed on their kernel?"

The first two class actions are already starting.

[1] https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[2] https://www.channele2e.com/analysis/crowdstrike-legal-and-li...

meowkit

2 replies

16h12m

2024-07-29 02:18:01 UTC

Because it wasn't an updated driver, it was a malformed blob config.

https://www.youtube.com/watch?v=ZHrayP-Y71Q https://www.youtube.com/watch?v=wAzEJxOo1ts

That verification is for interactions with the OS. Its not going to catch driver specific exceptions.

Animats

1 replies

13h44m

2024-07-29 04:46:20 UTC

If the driver can dereference nil, it shouldn't pass the Static Driver Verifier.[1]

[1] https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

whyever

0 replies

1h24m

2024-07-29 17:06:25 UTC

Not all potential null dereferences are covered by the verifier, they even give an example where the rule is not triggered, but null may be dereferenced by the code.

ldjkfkdsjnv

2 replies

22h8m

2024-07-28 20:22:54 UTC

The true story is that I bet some major divisions of Crowdstrike are ran by non technical people that got there through non meritocratic means. Theres generally been no repercussions for their underperformance, much like boeing. Crowdstrike business is built on relationships, not technical supremacy. And bada bing bada boom, we have a complete failure of basic technical competency (no rigourous role out process).

Paianni

1 replies

21h56m

2024-07-28 20:34:04 UTC

All business are built on relationships, technical competency can but doesn't have to be a means to that end.

Wytwwww

0 replies

21h53m

2024-07-28 20:37:06 UTC

technical competency

In a more fair world (that also valued economic productivity/growth more) companies which completely ignore that wouldn't survive, though.

WalterBright

2 replies

12h35m

2024-07-29 05:55:30 UTC

What I heard is that CrowdStrike normally rate limits pushing a fix. This is so that if the fix is bad, the damage is limited. But for some reason, the rate limiter was turned off and the update went out to everyone.

self_awareness

1 replies

11h14m

2024-07-29 07:16:09 UTC

What I heard is that CrowdStrike normally rate limits pushing a fix. This is so that if the fix is bad, the damage is limited. But for some reason, the rate limiter was turned off and the update went out to everyone.

Is this true though? They've released a Post Incident article:

https://www.crowdstrike.com/blog/falcon-content-update-preli...

in which they state:

How Do We Prevent This From Happening Again? Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

So it seems, if I understand this correctly, they've just implemented the rate limiter as a response to this incident.

WalterBright

0 replies

2h33m

2024-07-29 15:57:52 UTC

Either version might be true!

userbinator

1 replies

20h40m

2024-07-28 21:50:33 UTC

I'm going to be the controversial one here and say that, as bad as CrowdStrike was, the alternative of having only Microsoft be able to decide what people can do is far worse. I've already seen many others trying to use this incident to advocate for digital totalitarianism.

scarface_74

0 replies

16h32m

2024-07-29 01:58:05 UTC

Microsoft as the OS vendor will always be a potential source of updates that crash computers. Now with a third party, you’re adding another level of risk.

someonehere

1 replies

17h52m

2024-07-29 00:38:22 UTC

Unless actually required by your org, choose the N -1 policy in CS to avoid snafus like this in the future. It’s in the console so use it.

Maxious

0 replies

17h5m

2024-07-29 01:25:15 UTC

The N-1 policy never applied and still does not apply to "Rapid Response Content" as outlined in the Preliminary Post Incident Review https://www.crowdstrike.com/falcon-content-update-remediatio...

zh3

0 replies

21h46m

2024-07-28 20:44:24 UTC

I do have to wonder how many agonising layers of review this went through with the marketing and legal departments as part of shifting the blame.

If you want to decide which OS/distros to avoid for critical stuff, look to see who's learning from the incident (even if not bitten by it) compared to those saying "it wasn't our fault" (and that's not just MS).

squirrel

0 replies

19h30m

2024-07-28 23:00:11 UTC

Telling that there’s no mention of eBPF, which is standard on Linux and available on Windows, but hasn’t been brought into the main Windows OS. Static analysis might or might not have caught the Blue Friday bug, but it certainly increases the protection level over the current do-as-you-wish model for kernel modules.

sammyteee

0 replies

19h56m

2024-07-28 22:34:39 UTC

I stopped reading after "Windows is an open and flexible platform"

rldjbpin

0 replies

9h47m

2024-07-29 08:43:49 UTC

one thing from this whole fiasco that i wished bring to conversation was the fact that (crucial/market-dominant) digital/IT services don't have the same level of liability as mundane, physical goods.

a simple plastic covering of your new dyson has more legal scrutiny and action (see the "children may choke" warnings they all need to come with) than software that we otherwise block in the name of "national security".

given how much overvalued tech companies are in this region, i believe it is high time to start legally recognizing the real-life impact of digital tech. to hell with the "but muh innovation" argument.

eqvinox

0 replies

19h7m

2024-07-28 23:23:18 UTC

Move tool-tip APIs from kernel to user mode

?!?!

aurelien

0 replies

11h11m

2024-07-29 07:19:44 UTC

You use a distribution made with foot for secretary and gamers and you blindly try to explain where the problem is.

You are the clown's of the world, that's all ... xD

EasyMark

0 replies

20h50m

2024-07-28 21:40:31 UTC

Oh I like this breakdown a lot. Fairly technical, links to resources used, flow of debug process, didn’t get lost in a the weeds of details and how clever they were. I wish more debug retrospectives were like this. It seems like you end up with 100 pages of analysis or a couple of vague paragraphs.

DeathMetal3000

0 replies

20h0m

2024-07-28 22:30:28 UTC

“Windows has announced a commitment around the Rust programming language as part of Microsoft’s Secure Future Initiative (SFI) and has recently expanded the Windows kernel to support Rust.”