return to table of content

Crowdstrike Update: Windows Bluescreen and Boot Loops

cjbgkagh
29 replies
2h16m

Due to the scale I think it’s reasonable to state that in all likelihood many people have died because of this. Sure it might be hard to attribute single cases but statistically I would expect to see a general increase in probability.

I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.

My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.

dtech
14 replies
2h10m

I don't see what this has to much do with MS. A bad proprietary kernel module can crash any OS.

cjbgkagh
3 replies
2h0m

I don’t know the specifics of this case, but formal verification of machine code is an option. Sure it’s hard and doesn’t scale well but if it’s required then vendors will learn to make smaller kernel modules.

If something cannot be formally verified at the machine code level there should be a controls level verification where vendors demonstrate they have a process in place to achieving correctness by construction.

Driver devs can be quite sloppy and copy paste bad code from the internet, in the machine code Microsoft can detect specific instances of known copy and pasted code and knows how to patch it. I know they did this for at least one common error. But if I was in the business of delivering an OS I want people to rely on my OS this stuff formal verification at some level would be table stakes.

Analemma_
2 replies
1h52m

I thought Microsoft did use formal verification for kernel-mode drivers and that this was supposed to be impossible. Is it only for their first-party code?

speuleralert
0 replies
1h30m

No, I believe 3rd party driver developers must pass Hardware Lab Kit testing for their drivers to be properly signed. This testing includes a suite of Driver Verifier passes that are done, but this is not formal verification in the mathematical sense of the term.

cjbgkagh
0 replies
1h47m

I wasn’t privy to the extent it was used, if this was formally verified to be correct and still caused this problem then that really would be something. I’m guessing given the size and scope of an antivirus kernel module that they may have had to make an exception but then didn’t do enough controls checking.

ltadeut
2 replies
1h6m

MS could've leaned more towards user-space kernel drivers though. Apple has been going in that direction for a while and I haven't seem much of that (if anything) coming from MS.

That would have prevented a bad driver from taking down a device.

sgjohnson
0 replies
41m

Apple created their own filesystem to make this possible.

The system volume is signed by Apple. If the signature on boot doesn't match, it won't boot.

When the system is booted, it's in read-only mode, no way to write anything to it.

If you bork it, you can simply reinstall macOS in place, without any data/application loss at all.

Of course, if you're a tinkerer, you can disable both, the SIP, and the signature validation, but that cannot be done from user-space. You'll need to boot into recovery mode to achieve that.

I don't think there's anything in NTFS or REFS that would allow for this approach. Especially when you account for the wide variety of setups on which an NTFS partition might sit on. With MBR, you're just SOL instantly.

Apple hardware on the other hand has been EFI (GPT) only for at least 15 years.

nothercastle
0 replies
46m

Well we all know where Microsoft is in security… even the government acknowledges it’s terrible

falcor84
2 replies
1h57m

No other OS forces an auto-restart.

smsm42
0 replies
1h50m

Well, not the OS, per se, but macos updating mechanisms have auto-restart path, and I imagine any Linux update that touches the kernel can be configured in that way too. It's more the admin's decision then OS's but on all common systems auto-restart is part of the menu too.

nonfamous
0 replies
1h26m

No restart was needed to cause this crash. As soon as Falcon downloads the updated .sys file ... BOOM.

SAI_Peregrinus
2 replies
1h32m

An immutable OS can be set up to revert to the previous version if a change causes a boot failure. Or even a COW filesystem with snapshots when changes are applied. Hell, Microsoft's own "System Restore" capability could do this, if MS provided default-on support for creating system restore points automatically when system files are changed & restoring after boot failures.

zanellato19
0 replies
1h17m

Right, an OS completely crashing like this is the fault of the OS and the problematic code.

An OS should be really resistant to this kind of things.

wantsanagent
0 replies
56m

What's funny to me is that in college we had our computer lab set up such that every computer could be quickly reverted to a good working state just by rebooting. Every boot was from a static known good image, and any changes made while the computer was on were just stored as an overlay on a separate disk. People installed all manner of software that crashed the machines, but they always came back up. To make any lasting changes to the machine you had to have a physical key. So with the right kind of paranoia you can build systems that are resilient to any harmful changes.

philistine
0 replies
1h36m

I blame Microsoft in the larger sense; they still allow kernel extensions for use cases that Apple has shown could be moved outside the kernel.

Salgat
5 replies
1h59m

I love their forced updates, because if you know what you're doing you can disable them, and if you don't know what you're doing, well you shouldn't be disabling updates to begin with. I think people forget how virus infested and bug addled Windows used to be before they enforced updates. People wouldn't update for years and then bitch how bad Windows was, when obviously the issue wasn't Windows at that point.

__MatrixMan__
3 replies
1h48m

If the user wants to boot an older, known-insecure, version so that they can continue taking 911 calls or scheduling surgeries... I say let 'em. Whether to exercise this capability should be a decision for each IT department, not imposed by Microsoft on to their whole swarm.

philistine
1 replies
1h40m

Microsoft totally lets them. If you use any Enterprise version of Windows, the company can disable updates, but not the user.

__MatrixMan__
0 replies
1h33m

No, after the fact. Where's the prompt at boot-time which asks you if you want to load yesterday's known-good state, or today's recently-updated state?

It's missing because users are not to be trusted with such things, and that's a philosophy with harmful consequences.

monkmartinez
0 replies
1h32m

We took 911 calls all night, I was up listening to the radio all night for my unit to be called. The problem was the dispatching software didn't work so we used paper and pen. Glory Days!!!!

cjbgkagh
0 replies
1h50m

Ignoring all of the other approaches to that problem I wonder if this update will take the record for most damage done by a single virus/update. At some point the ‘cure’ might be worse than the disease. If it were up to me I would be suggesting different cures.

zzyzxd
2 replies
1h25m

At one company I used to work for, we had boring, airgapped systems that just worked all the time, until one day security team demanded that we must install this endpoint security software. Usually, they would fight tooth and nail to prevent devs from giving any in-house program any network access, but they didn't even blink once to give internet access to those airgapped systems because CrowdStrike agents need to talk to their mothership in AWS. It's all good, it's for better security!

It never caught any legit threat, but constantly flagged our own code. Our devs talked to security every other week to explain why this new line of code is not a threat. It generated a lot of work and security team's headcount just exploded. The software checked a lot of security checkboxes, and our CISO can sleep better at night, so I guess end of day it's all worth it.

seniorThrowaway
0 replies
1h11m

It never caught any legit threat, but constantly flagged our own code

When I worked in large enterprise it got to the point that if a piece of my app infrastructure started acting weird the blackbox security agents on the machines were the first thing I suspected. Can't tell you how many times they've blocked legit traffic or blown up a host by failing to install an update or logging it to death. Best part is when I would reach out to the teams responsible for the agents they would always blame us, saying we didn't update, or weren't managing logs etc. Mind you these agents were not installed or managed by us in any way, were supposed to auto update, and nothing else on the system outran the logrotate utility. Large enterprise IT security is all about checking boxes and generating paperwork and jobs. Most of the people I've interacted with on it have never even logged into a system or cloud console. By the end I took to openly calling them the compliance team instead of the security team.

cjbgkagh
0 replies
1h0m

I know I've lost tenders due to not using a pre-approved anti-virus vendors which really does suck and has impinged the growth of my company, but since I'm responsible for the security it helps me sleep at night. This morning I woke up to a bunch of emails and texts asking me if my systems have been impacted by this and it was nice to be able to confidently write back that we're completely unaffected.

I day-dream about being able to use immutable unikernels running on hypervisors so that even if something was to get past a gateway there would be no way to modify the system to work in a way that was not intended.

Air-gapping with a super locked down gateway was already getting more popular precisely due to the forced updates threat surface area, and after today I expect it to be even more popular. At the very least I’ll be able to point to this instance when explaining the rational behind the architecture which could help in getting exemptions from the antivirus box ticking exercise.

Supermancho
1 replies
2h11m

And god I hate their forced updates,

My windows machine notified me of the update, asked me to restart. I was busy, so I didn't. Then the news broke, then the update was rolled back.

vel0city
0 replies
1h31m

It wasn't a Windows update. If you got a notification for an update, it wasn't the update that did this.

RajT88
1 replies
1h1m

This is almost definitely on Crowdstrike.

There is a windows release preview channel that exists for finding issues like this ahead of time.

To be fair - it is possible the conflicting OS update did not make it to that channel. It is also possible it is due to an embarassing bug from MSFT (uknown as yet).

Until I hear that this is the case - I am pinning this on Crowdstrike. This should have been caught before prod.

cjbgkagh
0 replies
29m

Even if this is entirely due to Crowdstrike I see it as Microsofts failure to properly police their market.

There is the correctness by testing vs correctness by construction dynamic and in my view given the scale of interactions between an OS and the kernel modules trying to achieve correctness by testing is negligent. Even at the market scale Microsoft has there are not enough Windows computers to preview test every combination. Especially when taking into account the people on the preview ring have different behaviors to those on the mainline so many combinations simply won't appear in the preview.

I see it as Microsoft owning the Windows kernel module space and has allowed sloppiness by third parties and themselves, I don't know the specifics but I could easily believe that this is a due to a bug from Microsoft. The problem with allowing such sloppiness is that the slopy operators out compete the responsible operators, the bad pushes out the good until only the bad remains. A sloppy developer can push more code and gets promoted while the careful developer gets fired.

satisfice
0 replies
33m

As a tester, I'm frustrated by how little support testing gets in this industry. You can't blame bad testing if it's impossible to get reasonable time and cooperation to do more than a perfunctory job.

JackC
29 replies
5h44m

Crowdstrike did this to our production linux fleet back on April 19th, and I've been dying to rant about it.

The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.

When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.

Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.

JackC
9 replies
3h44m

Oh, if you are also running Crowdstrike on linux, here are some things we identified that you _can_ do:

- Make sure you're running in user mode (eBPF) instead of kernel mode (kernel module), since it has less ability to crash the kernel. This became the default in the latest versions and they say it now offers equivalent protection.

- If your enterprise allows, you can have a test fleet running version n and the main fleet run n-1.

- Make sure you know in advance who to cc on a support ticket so Crowdstrike pays attention.

I know some of this sounds obvious, but it's easy to screw up organizationally when EDR software is used by centralized CISOs to try to manage distributed enterprise risk -- like, how do you detect intrusions early in a big organization with lots of people running servers for lots of reasons? There's real reasons Crowdstrike is appealing in that situation. But if you're the sysadmin getting "make sure to run this thing on your 10 boxes out of our 10,000" or whatever, then you're the one who cares about uptime and you need to advocate a bit.

umanwizard
6 replies
3h2m

Just a nit, I don't think it's correct to call eBPF "user mode". It's just a different, much more sandboxed, way of running kernel-mode code.

anotherhue
2 replies
2h45m

We could call it, I don't know, "Protected Mode"?

Kye
1 replies
2h23m

It'll never catch on.

anankaie
0 replies
25m

Hear me out here: Maybe if we split the address space into various use-specific segments...

yencabulator
0 replies
2h43m

If you can crash Linux with an eBPF program, many more asses will have fires lit under them than just this one vendor.

ghostpepper
0 replies
1h24m

I would wager that even most software developers who understand the difference between kernel and user mode aren't going to be aware there is a "third" address space, which is essentially a highly-restricted and verified byte code virtual machine that runs with limited read-only access to kernel memory

boudin
0 replies
2h1m

It's what crowdstrike call it. To run falcon sensor as ebpf, you need to set it up as "user mode" which, I agree with you, is poorly named.

guax
0 replies
3h31m

Im suspicious that turning it off entirely would also provide equivalent protection as kernel and user space mode. If not more more.

MrDrMcCoy
0 replies
1h7m

Depending on what kernel I'm running, CrowdStrike Falcon's eBPF will fail to compile and execute, then fail to fall back to their janky kernel driver, then inform IT that I'm out of compliance. Even LTS kernels in their support matrix sometimes do this to me. I'm thoroughly unimpressed with their code quality.

kachapopopow
7 replies
5h37m

This is gold. My friend and me were joking around that they probably did this to macos and linux before, but nobody gave a shit since it's... macos and linux.

(re: people blaming it on windows and macos/linux people being happy they have macos/linux)

zarzavat
6 replies
4h56m

I don’t think people are saying that causing a boot loop is impossible on Linux, anyone who knows anything about the Linux kernel knows that it’s very possible.

Rather it’s that on Linux using such an invasive antiviral technique in Ring 0 is not necessary.

On Mac I’m fairly sure it is impossible for a third party to cause such a boot loop due to SIP and the deprecation of kexts.

nicce
5 replies
3h53m

I believe Apple prevented this also for this exact reason. Third-parties cannot compromise the stability of the core system, since extensions can run only in user-space.

vbezhenar
4 replies
2h39m

I might be wrong about it, but I feel that malware with root access can wreak quite a havoc. Imagine that this malware decides to forbid launch of every executable and every network connection, because their junior developer messed up with `==` and `===`. It won't cause kernel crash, but probably will render the system equally unusable.

zarzavat
0 replies
58m

Malware can do tons of damage even with only regular user access, e.g. ransomware. That’s a different problem from preventing legitimate software from causing damage accidentally.

To completely neuter malware you need sandboxing, but this tends to annoy users because it prevents too much legitimate software. You can set up Mac OS to only run sandboxed software, but nobody does because it’s a terrible experience. Better to buy an iPad.

neffy
0 replies
2h11m

Root access is a separate issue, but user space access to sys level functions is something Apple has been slowly (or quickly on the IOS platform, where they are trying to stop apps snooping on each other) clamping down on for years.

imtringued
0 replies
39m

It depends on your setup. If you actually put in the effort to get apparmor or selinux set up, then root is meaningless. There have been so many privilege escalation exploits that simply got blocked by selinux that you should worry more about setting selinux up than some hypothetical exploit.

Retr0id
0 replies
1h4m

On both macOS and Linux, there's an increasingly limited set of things you can do from root. (but yeah, malware with root is definitely bad, and the root->kernel attack surface is large)

HTG43
2 replies
4h30m

Interesting that they push updates on a Friday when support profile will be way different across companies and organizations during that time.

joezydeco
1 replies
2h59m

It makes you wonder if there was some critical vulnerability that forced them to deploy to everyone simultaneously at an awkward time.

philipwhiuk
0 replies
2h33m

AI probably thought there was a critical vuln.

Nemo_bis
1 replies
2h12m

Interesting. How was the faulty upgrade distributed? Not from Debian archives I assume.

MrDrMcCoy
0 replies
1h11m

CrowdStrike Falcon may ship as a native package, but after that it completely self-updates to whatever they think you should be running. Often, I have to ask IT to ask CS to revert my version because the "current" one doesn't work on my up-to-date kernel/glibc/etc. The quality of code that they ship is pretty appalling.

MetaWhirledPeas
1 replies
2h59m

we push software to your machines any time we want, whether or not it's urgent, without testing it

Do they allow you to control updates? It sounds like what you want is for a small subset of your machines using the latest, while the rest wait for stability to be proven.

bink
0 replies
2h41m

This is what happened to us. We had a small fraction of the fleet upgraded at the same time and they all crashed. We found the cause and set a flag to not install CS on servers with the latest kernel version until they fixed it.

philipwhiuk
0 replies
2h34m

You should send this to every tech reporter you like.

not_wyoming
0 replies
2h52m

we're a civic tech lab

Obviously not the point of your post, but say more? This sounds like it could be pretty cool!

OskarS
0 replies
1h57m

Please tell me you've ended the contract with CrowdStrike after this?

Kye
0 replies
5h12m

I wonder if the changes they put in behind the scenes for your incident on Linux saved Linux systems in this situation and no one thought to see if Windows was also at risk.

casey2
26 replies
2h52m

Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers? https://www.opensecrets.org/federal-lobbying/clients/summary... They have plenty of money for congress, but it seem little for any kind of reasonable software development practices. This isn't the first time crowdstrike has pushed system breaking changes.

chambored
7 replies
2h30m

According to that link the most money they contributed to lobbying in the past 5 years was $600,000 most years around $200,000. That’s barely the cost of a senior engineer.

113
6 replies
2h25m

You'd be surprised how cheap politicians are.

engineer_22
1 replies
2h20m

IIRC Menendez was accused and found guilty of accepting around $30,000 per year from foreign governments?

smsm42
0 replies
1h36m

That's probably only the part they had the hard proof for.

Also, the press release[1] says:

between 2018 and 2022, Senator Menendez and his wife engaged in a corrupt relationship with Wael Hana, Jose Uribe, and Fred Daibes – three New Jersey businessmen who collectively paid hundreds of thousands of dollars of bribes, including cash, gold, a Mercedes Benz, and other things of value

and later:

Over $480,000 in cash — much of it stuffed into envelopes and hidden in clothing, closets, and a safe — was discovered in the home, as well as over $70,000 in cash in NADINE MENENDEZ’s safe deposit box, which was also searched pursuant to a separate search warrant

This seems to be more than $120K over 4 years. Of course, not all of the cash found may be result of those bribes, but likely at least some of it is.

[1] https://www.justice.gov/usao-sdny/pr/us-senator-robert-menen...

dbalatero
1 replies
1h5m

I always half-jokingly think "should I buy a politician?"

I feel like a few friends could go in on it.

Kerbonut
0 replies
9m

It could be like an "insurance" where people pay for politician lobbying. Pool our resources and put it in the right spots.

ta1243
0 replies
29m

In the UK, a housing minister was bribed with £12,000 in return for a £45m tax break.

3750:1 return on investment, you don't get many investments that lucrative!

dwatson92
0 replies
1h43m

Ok but that point still defeats the premise that Crowdstrike are spending a large enough amount on lobbying that it is hampering their engineering dept.

lawlessone
4 replies
2h18m

Afaik didn't they hack republicans too? They only released democrat emails though.

meowface
3 replies
1h58m

Correct. Also, the DNC breach was investigated by FireEye and Fidelis as well (who also attributed it to Russia).

wewxjfq
0 replies
1h8m

So Ukraine's military and the app creator denied their artillery app was hacked by Russians, which might have caused them to lose some artillery pieces? Sounds like they aren't entirely unbiased. Ironically, DNC initially didn't believe they were hacked either.

laidoffamazon
0 replies
48m

Yeah this is the fringe view. The fact that the GRU is responsible is the closest thing you can get to settled in infosec.

Especially since the alternative scenarios described usually devolve into conspiracy theories about inside jobs

pjot
1 replies
2h23m

The DNC has since has implemented many layers of protection including crowdstrike, hardware keys, as well as special auth software from Google. They learned many lessons from 2016.

laidoffamazon
0 replies
49m

If I were to hazard a guess I think the OP is attempting to say they are incompetent and wrong in fingering the GRU as the cause of the DNC hacks (even though they were one of many groups that made that very obvious conclusion).

yurlungur
2 replies
2h15m

Given its origin and involvement in these high profile cases I always thought Crowdstrike is a government subsidized company which barely has any real function or real product. I stand corrected I guess.

Aperocky
1 replies
1h30m

This still doesn't demonstrate that it has any real function tbf.

andrewstuart2
0 replies
57m

Business Continuity Plan chaos gorilla as a service.

brookst
2 replies
2h31m

On the bright side, they are living up to their aptronym.

tyingq
1 replies
2h29m

I wonder if it might starting being a common turn of phrase. "Crowdstrike that directory", etc.

__MatrixMan__
0 replies
1h12m

There's a brokenness spectrum. Here are some points on it:

- operational and configured

- operational and at factory defaults

- broken, remote fixable

- crowdstruck (broken remotely by vendor, but not fixable remotely)

- bricked

Usage:

don't let them install updates or they'll crowdstrike it.
squigz
0 replies
8m

Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers?

Do you have any more sources on this specifically? The link you gave doesn't seem to reference anything specific.

bogzz
0 replies
2h29m

Corporate brainrot strikes again.

andrepd
0 replies
54m

Seems to be a perfectly rational decision to maximise short term returns for the owners of the company.

Now make of that what you will.

Aperocky
0 replies
1h31m

This demonstrated that Crowdstrike lacks the most basic of tests and staging environments.

cloin
21 replies
2h44m

I'm confused as to how this issue is so widespread in the first place. I'm unfamiliar with how Crowdstrike works, do organizations really have no control over when these updates occur? Why can't these airlines just apply the updates in dev first? Is it the organizations fault or does Crowdstrike just deliver updates like this and there's no control? If that's just how they do it, how do they get away with this?

commandlinefan
15 replies
2h8m

Can somebody summarize what CrowdStrike actually is/does? I can't figure it out from their web page (they're an "enterprise" "security" "provider", apparently). Is this just some virus scanning software? Or is it some bossware/spyware thing?

noduerme
13 replies
1h45m

It's both. Antivirus along with spyware to also watch for anything the user is doing that could introduce a threat, such as opening a phishing email, posting on HN, etc.

moffkalast
12 replies
1h36m

You know what, any workplace that thinks having such insane spyware on their machines deserves all they're getting today and more. This is schizophrenic tier paranoia.

smsm42
4 replies
1h33m

Most corporate places I've encountered over the last N years mandate one kind of antivirus/spyware combo or another on every corporate computer. So it'd be pretty much every major workplace.

mindslight
3 replies
51m

Sounds like the all too common dynamic of centralized top-down government/corporate "security" mandates destroying distributed real security. See also TSA making me splay my laptops out into a bunch of plastic bins while showing everyone where and how I was wearing a money belt. (I haven't flown for quite some time, I'm sure it's much worse now)

There's a highly problematic underlying dynamic where 364 days out of the year, practitioners of actual security talk about the dangers of centralized control and proprietary software, and get flat out ignored as being overly paranoid and even weird (don't you know that normal people have zero ability or agency when it comes to anything involving computers?!). Then something like this happens and we get a day or two to say "I told you so". After which the managerial class goes right back to pushing ever-more centralized control.

noduerme
2 replies
42m

They fixed that. Now you can fly without taking your laptop out, or taking your shoes and belt off. You just have to give them fingerprints, a facial scan and an in-person interview. They give you a little card. It's nifty.

dingnuts
1 replies
21m

there's nothing socially repressive about having airline travel segregated into classes of passengers at all, nope, this is completely normal /s

I go through the regular TSA line out of solidarity and protest. Fuck the security theater.

noduerme
0 replies
12m

My response was intended as sarcasm. But eventually, I don't think it will be a two-tiered system. You simply won't be allowed to fly without what is currently required for precheck.

And fwiw, I don't think the strong argument against precheck has to do with social class... it's not terribly expensive, and anyone can do it. It's just a further invasion of privacy.

hbn
1 replies
1h31m

At my work in the past year or 2 they rolled out Zscaler onto all of our machines which I think is supposed to be doing a similar thing. All it's done is caused us regular network issues.

I wonder if they also have the capability to brick all our Windows machines like this.

commandlinefan
0 replies
44m

Ah, yeah, they gave us zscaler not too long ago. I wondered if it was logging my keystrokes or not, figured it probably was because my computer slowed _way_ down ever since it appeared.

kjkjadksj
0 replies
56m

Paranoid? Phishing is very successful.

dingnuts
0 replies
24m

This kind of thing is required by FedRAMP. Good luck finding a company without ending management software who is legally allowed to be a US government vendor.

If you stick to small privately held companies you might be able to avoid ending management but that's it.. any big brand you can think of is going to be running this or something similar on their machines -- because they're required to

__MatrixMan__
0 replies
49m

So there's the control freak at the top who made this decision, and then there are the front lines who are feverishly booting into safe mode and removing the update, and then there are the people who can't get the data they need to safely perform surgeries.

So yeah, screw 'em. But let's be specific about it.

Wheaties466
0 replies
28m

i'd think you'd want some sort of controls/detection on infrastructure level machines.

above comment is very naive.

CrimsonCape
0 replies
1h22m

Once you get legal involved the employee becomes the liability, not the asset.

kube-system
0 replies
1h58m

Is this just some virus scanning software?

Essentially, yes. It is fancy endpoint protection.

rboyd
0 replies
2h31m

Companies operate on a high level of fear and trust. This is the security vendor, so in theory they want those updates rolled out as quickly as possible so that they don't get hacked. Heh.

prpl
0 replies
2h26m

I mean, they pay a lot of money to crowdstrike. A failure this widespread is a Crowdstrike dev issue.

mym1990
0 replies
1h42m

These updates happen automatically and as far as I can tell, there is no option to turn this feature off. From a security perspective, the vendor will always want you to be on the most recent software to protect from attack holes that may open up by operating on an older version. Your IT department will likely want this as well to avoid culpability. Just my 2 observations, whether it is the right away or if CS is effective at what it does, no idea.

jmsgwd
0 replies
1h6m

Presumably endpoint detection & response (EDR) agents need to do things like dynamically fetch new malware signatures at runtime, which is understandable. But you'd think that would be treated as new "content", something they're designed to handle in day-to-day operation, hence very low risk.

That's totally different to deploying new "code", i.e. new versions of the agent itself. You'd expect that to be treated as a software update like any other, so their customers can control the roll out as part of their own change management processes, with separate environments, extensive testing, staggered deployments, etc.

I wonder if such a content vs. code distinction exists? Or has EDR software gotten so complex (e.g. with malware sandboxing) that such a distinction can't easily be made any more?

In any case, vendors shouldn't be able to push out software updates that circumvent everyone's change management processes! Looking forward to the postmortem.

apitman
0 replies
1h40m

CrowdStrike is an endpoint detection and response (EDR) system. It is deeply integrated into the operating system. This type of security software is very common on company-owned computers, and often have essentially root privileges.

steelframe
16 replies
2h51m

Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.

BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.

The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

bonestamp2
7 replies
2h28m

without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.

rvnx
6 replies
2h2m

Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.

Lx1oG-AWb6h_ZG0
3 replies
1h53m

Wouldn't this be an attack vector? Use some low-hanging bug to bring down an entire security module, allowing you to escalate?

SAI_Peregrinus
1 replies
1h22m

It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.

hunter2_
0 replies
54m

But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.

sudosysgen
0 replies
1h16m

If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.

tatersolid
1 replies
1h18m

Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.

Wheaties466
0 replies
32m

it does. several crowdstrike alerts popped when i was remediating systems of the broken driver.

dralley
2 replies
2h10m

https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”

qingcharles
0 replies
30m

Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to known which value was returned from that function? Ughh.

Arrath
0 replies
30m

That's beautiful.

brcmthrowaway
2 replies
19m

What does this mean?

Windows kernel paged, linux non paged?

ww520
0 replies
5m

The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and it's paged back in from the disk when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.

temac
0 replies
4m

"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?

brightlancer
0 replies
2h11m

Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?

stainablesteel
16 replies
2h35m

its strange how people who work in professions that are considered crucial infrastructure are held to such a high standard but there's always some tech problem that cripples them the hardest

queuebert
13 replies
2h31m

And they all invariably use Windows instead of a high-reliability OS.

cypress66
12 replies
2h27m

Windows is high reliability. The problem here is what's basically a third party backdoor.

bestouff
5 replies
2h22m

"Windows" is the combination of the OS per se and all the things needed for it to run properly. That thing is a mess of proprietary drivers and pieces of software cobbled together. It can't be called "high reliability" with a straight face.

BuckRogers
3 replies
1h26m

That's a hell of a take that should not be taken seriously. Perhaps if you hold everything else to the same standard. Anything used on macOS or Linux or whatever else fully and completely represents that core platform, then I'd agree.

Anecdotally, I have zero stability problems on my non-ECC consumer-grade 11th gen Intel Windows 11 system. It'll stay up for months, until I decide to shut it down. I had a loose GPU power cable that was causing me problems at a point, but since I reseated everything I haven't had a single issue. That was my fault, things happen. The system is great.

More significantly, I see no difference in stability between our Windows Server platform and Red Hat Enterprise (Oracle) server platform at work either. Work being one of the top 3 largest city governments in the USA.

dagss
0 replies
48m

Not really disagreeing with you, but "staying up for months" isn't a serious bar to clear, it really provides no information in 2024 everything you can install should clear that bar.

codebolt
0 replies
23m

Meanwhile, I'm lucky if the laptop I installed Ubuntu on will keep from crashing for over an hour of continuous use.

beginnings
0 replies
38m

its an accurate take, windows is a mess

didnt red hat have a massive DEI/anti white man scandal? I wouldnt trust their products

the smartest people use and maintain Arch, ergo everything should run on Arch for maximum stability

lupire
0 replies
1h59m

Crowdstrike is a multiplatform malware that chronically damages computers on all major desktop OSes. This is a Crowd strike problem and an admin problem.

OskarS
3 replies
2h3m

Can you say with a straight face that if you were designing a system that had extremely high requirements of reliability that you would choose Windows over Linux? Like, all other things being equal? I'm sorry, but that would be an insane choice.

TeMPOraL
1 replies
1h48m

Well, yes? Of course, not the consumer deployment of Windows. Part of ensuring reliability is establishing contracts with suppliers that shift liability to them, so they're incentivized to keep their stuff reliable. Can't exactly do that with Linux (RHEL notiwthstanding) and open source in general, which is why large enterprises have been so reluctant to adopt them in the past - they had to figure out how to fit OSS into the flow of liability and responsibility.

bre1010
0 replies
1h21m

I guess it depends whether you want your system to work, or whether you just want it to be not your fault when it breaks

cthalupa
0 replies
1h57m

Well, with the proliferation of systemd and all the nightmares it's caused me over the past decade, I actually might. But thankfully BSD is an option.

But Linux isn't immune from this exact sort of issue, though - these overgrown antivirus solutions run as kernel drivers in linux as well, and I have seen them cause kernel panics.

lawlessone
0 replies
1h51m

Windows is high reliability.

Depends i think. When i was working as a super market cashier the tils had embedded XP. in 2 or 3 years it rarely had issues. The rare issues it did have were with the java POS running on top.

Windows 10 for my home desktop crashed a lot more and just seems to have gotten more "janky" with time.

DeepYogurt
0 replies
2h24m

Windows is high reliability.

lol no

basch
0 replies
2h26m

There are sooo many companies in the world, when snowflake or crowdstrike or solarwinds has an issue, it’s going to touch every industry.

TeMPOraL
0 replies
1h49m

The people working in those professions are; their bosses and their IT departments are not. IT security is treated as solved problem - if you deploy enough well-known solutions that prevent your employers from working, everything will be Safe from CyberAttacks. There's an assumption of quality like you'd normally have with drugs or food in the store. But this isn't the case in this industry, doubly so in security. Quality solutions are almost non-existent, so companies should learn to operate under the principle of caveat emptor.

tbatchelli
11 replies
2h17m

This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.

mym1990
3 replies
1h37m

Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.

tbatchelli
2 replies
38m

Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.

In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.

slt2021
0 replies
10m

  - This is system has a single point of failure, it is not fault tolerant. Lets introduce these three things to make it fault-tolerant
  - Now you have three single points of failure...

lucianbr
0 replies
12m

You can make a system tolerant to certain faults. Other faults are left "untolerated".

A system that can tolerate anything, so have perfect availability, seems clearly impossible. So yeah, totally right, it's always a tradeoff. That's reasonable, as long as you trade smart.

I wonder if the people deciding to install Crowdstrike are aware of this. If they traded intentionally, and this is something they accepted, I guess it's fine. If not... I further wonder if they will change anything in the aftermath.

jessriedel
2 replies
1h5m

It's also in line with arguments made by Ted Kaczynski (the Unabomber)

Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.

https://www.overcomingbias.com/p/kaczynskis-collapse-theoryh...

https://en.wikipedia.org/wiki/Anti-Tech_Revolution

localfirst
1 replies
1h0m

crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.

washadjeffmad
0 replies
30m

I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.

There was a quote last year during the "Twitter files" hearing, something like, "it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly".

Perhaps ironically, I had a difficult time using Google to find the exact wording of the quote or its source. The only verbatim result was from a NYPost article about the hearing.

COGlory
1 replies
1h41m

We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.

bjelkeman-again
0 replies
1h5m

Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.

ricardo81
0 replies
42m

I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".

Basically delegation.

notNNT
0 replies
1h13m

Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.

nimbius
11 replies
2h14m

I work for a diesel truck repair facility and just locked up the doors after a 40 minute day :( .

- lifts wont operate.

- cant disarm the building alarms. (have been blaring nonstop...)

- cranes are all locked in standby/return/err.

- laser aligners are all offline.

- lathe hardware runs but controllers are all down.

- cant email suppliers.

- phones are all down.

- HVAC is also down for some reason (its getting hot in here.)

the police drove by and told us to close up for the day since we dont have 911 either.

alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)

we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.

shoebham
7 replies
1h22m

wow, why do lifts require an OS?

thedrbrian
3 replies
23m

Why do lathes , cranes and laser alignment systems need a new copy of windows?

Kirth
1 replies
16m

and why do they run spyware?

recursive
0 replies
14m

Probably because some fraction of lift manufacturer's customer base has a compliance checklist requiring it.

olyjohn
0 replies
15m

Lathes probably have PCs connected to them to control them, and do CNC stuff (he did say the controllers). Laser alignment machines all have PCs connected to them these days.

The cranes and lifts though... I've never heard of them being networked or controlled by a computer. Usually it's a couple buttons connected to the motors and that's it. But maybe they have some monitoring systems in them?

kulikalov
1 replies
1h8m

the question is - why lifts require windows?

rudasn
0 replies
8m

Well, how else is the operator supposed to see outside?

warkdarrior
0 replies
1h15m

How else are you going to update your grocery list while operating the lift?

saganus
1 replies
1h20m

How come lifts and cranes are affected by this?

Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?

I can see how alarms, email and phones are affected but the heavy machinery?

(Clearly not familiar with any of these things so I am genuinely curious)

lima
0 replies
6m

Lots and lots of heavy machinery uses Windows computers even for local control panels.

__MatrixMan__
0 replies
56m

Oh man, you work with some cool (and dangerous) stuff.

Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?

upofadown
9 replies
3h3m

Perhaps a dumb question for someone who actually knows how Microsoft stuff works...

Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Added: OK, from another post I now know Crowdstrike has some sort of kernel mode that allows this sort of catastrophe on Linux. So I guess there is a bigger question here...

atoav
3 replies
2h30m

Maybe I am in the minority, but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

If there is any place that historically was exploited more than all other things it was broken parsers. Congratulations if such an exploited file is now read by your AV-software it now sits now at a position where it is allowed (expected) to read all files and it would not surprise me if it could write them as well.

And you just doubled the number of places in which things can go wrong. Your system/software that reads a PNG image might do everything right, but do you know how well your AV-software parses PNGs?

This is just an example, but the question we really should ask ourselves is: why do we have systems where we expect malicous files to just show up in random places? The problem with IT security is not that people don't use AV software, it is that they run systems that are so broken by design that they are sprinkled on top.

This is like installing a sprinkler system in a house full of gasoline. Imagine gasoline everywhere including in some of the water piping — in the best case your sprinkler system reacts in time and kills the fire, in the worst case it sprays a combustive mix into it.

The solution is of course not to build houses filled with gasoline. Meanwhile AV-world wants to sell you ever more elaborate, AI-driven sprinkler systems. They are not the ones profiting from secure systems, just saying..

hnthrowaway0328
1 replies
54m

I wonder why and how does security software read a PNG file. Sure it's not tough to parse a PNG file, but what does it look for exactly?

Sohcahtoa82
0 replies
15m

Some file formats allow data to be appended or even prepended to the expected file data and will just ignore the extra data. This has been used to create executables that happen to also be a valid image file.

I don't know about PNG, but I'm fairly sure JPEG works this way. You can concatenate a JPEG file to the end of an executable, and any JPEG parser will understand it fine, as it looks for a magic string before beginning to parse the JPEG.

A JPEG that has something prepended might raise an eyebrow. A JPEG that has something executable prepended should raise alarms.

Sohcahtoa82
0 replies
1h19m

but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

Because otherwise, a piece of malware that installs itself at a "mega-privileged" level can easily make itself completely invisible to a scanner running as a low-priv user.

Heck, just placing itself in /root and hooking a few system calls would likely be enough to prevent a low-priv process from seeing it.

red-iron-pine
1 replies
1h30m

Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Because malware that gets into a system will do just that -- install its own backdoor drivers -- and will then erect defense to protect itself from future updates or security actions. e.g. change the path that Windows Updater uses to download new updates, etc.

Having a kernel module that answers to CloudStrike makes it harder for that to happen, since CS has their own (non-malicious) backdoor to confirm that the rest of the stack is behaving as expected. And it's at the kernel level, so it has visibility into deeper processes that a user-space program might not (or that is easy to spoof).

sudosysgen
0 replies
41m

Or, much more likely, the malware will use a memory access bug in an existing, poorly written kernel module (say, CrowdStrike?) to load itself at the kernel level without anyone knowing, perhaps then flashing an older version of the BIOS/EFI and nestle there, or finding it's way into a management interface. Hell, it might even go ahead and install an existing buggy driver by itself it's not already there.

All of these invasive techniques end up making security even worse in the long term. Forget malware - there's freely available cheating software that does this. You can play around with it, it still works.

tonymet
0 replies
1h3m

Vendors are allowed to install drivers , even via Windows update. Many vendors like HP, install functionality like telemetry as drivers to make it more difficult for the users to remove the software.

So next time you think you are doing a "clean install", you are likely just re-installing the same software that came with the machine.

gusfoo
0 replies
1h47m

Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

While the files are named XXX.SYS they are apparently not drivers. The issue is that a corrupted XXX.SYS was loaded by the already-installed driver which promptly crashes.

TiredOfLife
0 replies
1h57m

As I understand it was a definition update that caused a crash inside already installed driver.

belter
9 replies
1h14m

It's not the first time they pull something similar...1 month ago: "CrowdStrike bug maxes out 100% of CPU, requires Windows reboots" - https://www.thestack.technology/crowdstrike-bug-maxes-out-10...

75 Billion dollars valuation, CNBC Analysts praising the company this morning on how well the company is run!...When in reality they can't master the most basic of the phased deployment methodologies known for 20 years...

Hundreds of handsomely paid CTO's, at companies with billions of dollars in valuations, critical healthcare, airlines, who can't master the most basic of the concepts of "Everything fails all the time"...

This whole industry is depressing....

chronid
3 replies
1h3m

What I find definitely depressing is the fact we used to roll out progressively even OS upgrades (I guess now that is done through intune?) and was one point in favor of windows (on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much).

Nowadays we get mandated random software upgrading at once on the entire company fleet and no one bats an eye - I counted more than a dozen agents installed for "security" and "monitoring" purposes in my previous company servers, many of those with hooks in the kernel obviously, and many of those installed with random policies to tick yet another compliance box...

consp
2 replies
1h0m

(on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much)

You can schedule the updates any time you want, want to do it staggered then configure that, want to do it all at the same time then do that, want it with a random interval also possible. I don't see the "you need to do everything yourself" option as much as any managed environment.

chronid
0 replies
51m

Centralized management is very useful, just a random delay is not enough. One of the (big) companies I worked with had jury rigged something with chef I believe to show different machines different "repositories" and roll things out progressively (1% of the fleet, 5%...).

chefandy
0 replies
43m

I haven't been a sys admin in a very long time so my systems knowledge might be outdated, but I reckon functionality like intune's built-in monitoring of specific feature install failures would make a huge difference with a few dozen systems, let alone the hundreds of thousands you see in some of today's deployments. It's not like that stuff isn't possible on Linux, but if you're coordinating more than a few systems, that turns into a big, expensive project pretty quickly.

monkmartinez
1 replies
1h9m

This borked our dispatch/911 call center then as well. However, it wasn't as bad as this one. This outage put our entire public safety system into the stone age and with that we were at stone age efficiency.

BaldricksGhost
0 replies
58m

I work IT at a regional 911 center. We're fine but I sympathize with those who are back to pen and paper dispatching. Hard for most current dispatchers to realize the way we did it back in the day.

segasaturn
0 replies
1h0m

Those "CNBC analysts" truly know nothing, especially when it comes to tech. They're just cheerleaders who repeat talking points all days.

nothercastle
0 replies
49m

The worst part is that nobody will be held accountable. A F up like this should wipe out the entire company but instead everyone will just shrug it off as an opposie a few low level employees will get punished and nothing will change.

alexose
0 replies
45m

This whole industry is depressing....

I'll take it a step further and say that every industry is depressing when it comes to computers at scale.

Rather than build efficient, robust, fault-tolerant, deterministic systems built for correctness, we somehow manage to do the exact opposite. We have zettabytes and exaflops at our fingertips, and yet, we somehow keep making things slower. Our user interfaces are noisier than ever, and our helpdesks are less helpful than they used to be.

UniverseHacker
6 replies
2h3m

Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?

This is just basic IT common sense. You only do updates during a planned outage, after doing an easily reversible backup, or you have two redundant systems in rotation and update and test the spare first. Critical systems connected to things like medical equipment should have no internet connectivity, and need no security updates.

I follow all of this in my own home so a bad update doesn’t ruin my work day… how do big companies with professional IT not know this stuff?

basch
2 replies
1h41m

You do that for antivirus definition updates?

mckn1ght
0 replies
51m

Probably, implicitly. Have automated regular backups, and don’t let your AV automatically update, or even if it does, don’t log into all your computers simultaneously. If you update/login serially, then the first BSOD would maybe prevent you from doing the same thing on the other (or possibly, send you running to the other to accomplish your task, and BSODing that one too!)

But yeah this is one reason why I don’t have automatic updates enabled for anything, the other major one being that companies just can’t resist screwing with their UIs.

UniverseHacker
0 replies
29m

I’m not an IT professional, but I don’t use antivirus software on my personal macs and linux machines- I do regular rotated physical backups, and only install software digitally signed by trusted sources and well reviewed Pirate Bay accounts (that's a joke :-).

My only windows machine is what I would classify as a mission critical hardware connected/control device, an old Windows 8 tablet I use for car diagnostics- I do not connect it to the internet, and never perform updates on it.

I am an academic and use a lot of old multi-million dollar scientific instruments which have old versions of windows controlling them. They work forever if you don't network them, but the first time you do, someone opens up a browser to check their social media, and the entire system will fail quickly.

vel0city
1 replies
49m

CrowdStrike let's you create update strategies and rollout groups.

This update bypassed all of those settings.

gedy
0 replies
1h54m

Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?

Because it lets them "scale" by having fewer and cheaper offsite IT and contractors to manage vs hiring pesky onsite employees.

egberts1
3 replies
4h51m

Yet Lennart Pottering and Redhat (spelled that way as I am one of the original pre-IPO investor of RedHat via Alex Brown/Deutsche Bank) wants to put networking of Linux into UEFI this quarter, inside the most sacrosanct PID 1.

They still won’t learning anything from Crowdstrike’s mistakeS!

Maybe it is time for me to ditch that stock.

btreecat
2 replies
4h39m

Source of claim?

egberts1
0 replies
4h14m

Network sockets are in the systemd code repository.

1024core
3 replies
1h25m

Read on Mastodon: https://infosec.exchange/@littlealex/112813425122476301

The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.

If at first you don't succeed, .... ;-) j/k

localfirst
2 replies
1h2m

Kurtz response is ridiculous blaming the customer on X. He will probably find another company to hire him as CEO tho. Just an upside down world in the C-suite world.

jeffrallen
0 replies
13m

That guy is gonna fail all the way right up to the top. Sheesh.

BaldricksGhost
0 replies
51m

Don't forget the golden parachute. These guys always seem to fail upward.

whoisstan
2 replies
4h8m

Can someone with experience explain how integration tests did not detect that?

guax
1 replies
2h51m

Why are you assuming there were tests?

whoisstan
0 replies
2h19m

Right.

I just can‘t imagine how it passed tests for a common configuration that is exhibited by large number of windows machines. Stuff always can go wrong, but OS is not booting should be caught?

remram
2 replies
3h25m

I can't wait to see the CloudFlare traffic report after this. All those computers going down must have affected traffic worldwide. Even from Linux systems as their owners couldn't run jobs from their bricked Windows laptops.

remram
0 replies
2h18m

Interesting! Thanks for that. I guess most servers and consumer endpoints are fine, and those are driving all the traffic.

markus_zhang
2 replies
5h1m

In pre-market, CRWD is 14% down. I think investors are a bit scared that THIS time there is going to be some consequences.

rjmunro
1 replies
4h50m

I'm amazed it's just 14%, not more like 75%-80%. Surely a lot of customers are going to uninstall and move to competitors. The remainders are at least going to demand much cheaper service with better guarantees going forward.

markus_zhang
0 replies
4h25m

Yeah, and now recovered to -9.39%. Let's see what happens. I guess CrowdStrike is backed by enough powerful people to NOT lose too much business.

vlan0
1 replies
2h9m

Anything that has root/kernel access is a risk. It always has been. When will we learn. Probably never. Because money runs this world. So sad. Time to open a bakery and move on from this world.

Sohcahtoa82
0 replies
1h24m

Considering what Crowdstrike is intended to do, it's not really possible for it to work without running at the kernel level.

sytelus
1 replies
4h28m

Genuine question: How the heck crapeware like CloudStrike got into all critical systems from 911 to hospitals to airlines? My understanding was that all these critical systems are just super lazy to upgrade or install anything at all. I would love to know all the sales tactics CS used to get into millions of systems for money!

ncr100
0 replies
3h19m

Reading other comments here, sorry I don't have the link, one crowd strike salesperson threatened to cancel them as a Client, yes you read that right, if the client wasn't easier to work with. So they're bullies or at least that one salesperson in crowd strike is a bully.

Another article talked about crowd strike being required for compliance, people here talking about checkbox compliance. So there's a systemic requirement from perhaps insurers for there to be some kind of comprehensive near real-time updated antivirus solution.

Furthermore, the haste makes waste philosophy seems to not be honored, in my opining mind, by the minds who drive The impacted sectors of our economy. Hospitals, Banks, airlines. This kind of vulnerability should not have been accepted. It's a single point of failure. Even on crowdstrike's website they have this kind of like radar ring hotspot Target kind of graphic, where they show at the very center one single client app .. theirs, as if that one single client is the thing that's going to save us?

stevetron
1 replies
3h3m

Working late Thursday night in Florida, USA. I have someone in Australia wanting me to write a quick script in LSL for an object in Second Life. We were interrupted: Second Life kept running, but Discord went down, telling me to 'try another server' which doesn't make sence when you are 1-on-1 with someone. All my typing in Discord turned red. Additionally, I couldn't log into the email portal for outlook.com: I got a screen of tiny-fonted text all clinging to the left edge of the display, unreadable, unusable. Second Life, though, stayed online and kept working for me, but then I'm on Windows 7. My friend who had requested the collaboration froze in Second Life on his Windows 10 system, and I don't know what his Discord was doing. I ended the session since I couldn't get a no/no-go out of him for the latest script version.

_def
0 replies
2h59m

Wow I didn't know second life was still a thing. Literally yesterday I looked at a 20 year old archived version of a freeware portal which also listed a version of second life.

rboyd
1 replies
2h33m

Seems like a modern operating system would have an automatic rollback mechanism for cases like this.

Kye
0 replies
1h29m

Windows has restore points that do this in the event of a failed update, but this wasn't Windows.

qwerty456127
1 replies
48m

WTF is CrowdStrike and why is it affecting so many people and companies? I've never heard of it before. And apparently it isn't anything relevant to all Windows users as it didn't affect any computer of any person I personally know.

tonymet
0 replies
42m

Very popular corporate endpoint protection (malware detection and spyware) that runs telemetry & monitoring agents installed as kernel-mode drivers on windows. Thus if there is a crash, it crashes the entire kernel (BSOD) . And their drivers load at boot.

hughw
1 replies
5h31m

So, why did our little company's (little used) two Windows machines not BSOD overnight? They were just sitting idle. They run CS Falcon sensor. Did the update force a restart? Didn't seem to happen here.

Kye
0 replies
4h10m

It looks like a configuration file update is the culprit. The software presumably picks up the update, then BSODs.

franczesko
1 replies
2h0m

I just wanted to mention that Microsoft has 3 tiers of Windows beta releases before changes are pushed to production. I can't comprehend how this wasn't noticed before.

bloopernova
0 replies
1h56m

It didn't come from Microsoft or Windows Update. It was pushed by Crowdstrike to their corporate security kernel extension.

convivialdingo
1 replies
4h16m

Here’s my take as a security software dev for 15 years.

We put too much code in kernel simply because it’s considered more elite than other software. It’s just dumb.

Also - if a driver is causing a crash MSFT should boot from the last known-good driver set so the install can be backed out later. Reboot loops are still the standard failure mode in driver development…

erichocean
0 replies
4h13m

Not possible in this situation, the "driver" is fine, it's a file the driver loads during startup that is bad, causing the otherwise "good" driver to crash.

Going back to an earlier version—since the driver is "good—would just re-load the same driver, loading the updated file, and then crashing again.

zzhelezc
0 replies
3h17m

From the BBC's cyber correspondent Joe Tidy [1]:

A "content update" is how it was described. So, it wasn’t a major refresh of the cyber security software. It could have been something as innocuous as the changing of a font or logo on the software design.

He can't be serious, right? Right?

[1] https://www.bbc.co.uk/news/live/cnk4jdwp49et?post=asset%3Abd...

xyst
0 replies
3h26m

CRWD dropped $50/share at market open. Wild.

Is this specific to only Windows machines “protected” with CS or is this impacting Linux/macOS as well?

tonymet
0 replies
1h36m

No rolling updates? How could a 100% repro BSOD pass QC? I'm more concerned about the deployment process than the crash itself. Everyone experiences a bad build from time to time. How did this possibly go live?

thomasjudge
0 replies
4h57m

Why are "security" patches not tested before they are deployed?

tgtaptarget
0 replies
2h28m

In my org, none of the essential systems went down (those used by labor). However all of management's individual PCs went down which got me wondering... Is this the beginning (or continuation) of whittling down what is "essential" human labor versus what could be done remotely (or eliminated completely)?

Or perhaps Microsoft is just garbage and soon will be as irrelevant as commercial real estate office parks and mega-call centers

tedajax
0 replies
5h37m

The first time I experienced crowdstrike in a corporate environment it seemed obvious that something like this would eventually happen.

swozey
0 replies
2h44m

I know there's a better word to be used here, but what initially looked like a massive cyberattack turning out to be a massive defender foot-broom is chefs kiss.

I saw it was Windows and went to bed. What a great feeling.

I'm sorry to those of you dealing with this. I've had to wipe 1200 computers over a weekend in a past life when a virus got in.

Did I receive any appreciation? Nope. I was literally sleeping under cubicle desks bringing up isolated rows one by one. I switched everything in that call center to linux after that. Ironically it turns out it was a senior engineers ssh key that got leaked somehow and was used to get in and dig around servers in our datacenter outside of my network. My filesystem logging (in Windows, coincidentally) alerted me.

IT is fun.

snappr021
0 replies
2h58m

“To err is human, but to really fuck things up requires a computer.” ~ Len Beattie

snailb
0 replies
5h34m

On a positive note, I'm in morocco and getting money from ATM wasn't working for the whole day I believe because of this outage. I was at the till in a supermarket and people started asking if they can chip in to pay for some food I bought because I didn't have the cash.

Humanity 1 - Technology 0

Edit: Outage of all ATM's in Morocco was yesterday not today. so not sure how the two are related.

siliconc0w
0 replies
13m

The postmortem will should interesting, can't imagine how even just basic integration testing didn't catch this. Much less basic best practice like canarying.

satisfice
0 replies
39m

I want to say the problem is that the industry has systematically devalued software testing in favor of continuous delivery and the strategy of hoping that any problems are easy to roll back.

But it's deeper than that: the industry realizes that, once you get to a certain size, no one can hurt you much. Crowdstrike will not pay a lasting penalty for what has just happen, which means executives will shrug and treat this as a random bolt of lightning.

sans_souse
0 replies
5h13m

Why would you name your company "CrowdStrike" anyway? What does Crowd Strike even mean?

rustcleaner
0 replies
24m

Thank Chronos I switched to Qubes OS almost two years ago!

rs999gti
0 replies
1h44m

So can crowdstrike be classified as malware now?

Currently waiting in line for 2 hours + waiting for Delta to tell me when my connecting leg can be booked. My current flight is delayed 5 hours.

rootforce
0 replies
1h56m

AWS has posted some instructions for those affected by the issue using EC2.

[AWS Health Dashboard](https://health.aws.amazon.com/health/status)

"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.

Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:

1. Create a snapshot of the EBS root volume of the affected instance

2. Create a new EBS volume from the snapshot in the same Availability Zone

3. Launch a new instance in that Availability Zone using a different version of Windows

4. Attach the EBS volume from step (2) to the new instance as a data volume

5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"

6. Detach the EBS volume from the new instance

7. Create a snapshot of the detached EBS volume

8. Create an AMI from the snapshot by selecting the same volume type as the affected instance

9. Call replace root volume on the original EC2 Instance specifying the AMI just created"

rewgs
0 replies
27m

The most concerning thing about this is the realization of just how many incredibly critical systems run on Windows.

resters
0 replies
2h56m

Any company that inserts itself so heavily into US politics cannot be counted on as a solid engineering organization.

raphar
0 replies
2h23m

I want to see the internal postmortem of why this happened to CrowdStrike (if they are still in business)

rajeshivivek
0 replies
3h2m

SARRRRRRRSSSSS!

rajeshivivek
0 replies
3h11m

DO NOT REDEEM SAARRRRRRRRSSSSS! BLODDY BASTARDS INVALID FORMATING SARRRRRRSSSSS!!

purpleblue
0 replies
3h1m

Do all the machines need to be manually fixed? It doesn't seem like an automatica update will work here...

piuantiderp
0 replies
1h34m

how come does anyone still use crowdstrike?

nu11ptr
0 replies
57m

This whole thing likely would have been averted had microkernel architectures caught on during the early days (with all drivers in user mode). Performance would have likely been a non-issue, not only due to the state of the art L4 designs that came later, but mostly because had it been adopted everything in the industry would have evolved with it (async I/O more prevalent, batched syscalls, etc.).

I will admit we've done pretty well with kernel drivers (and better than I would have ever expected tbh), but given our new security focused environment it seems like now is the time to start pivoting again. The trade offs are worth it IMO.

nothercastle
0 replies
40m

I love how their company name foreshadows this exact event. It’s malware pretending to be a security suite.

ngneer
0 replies
5h35m

Security technology harming security? Shocker. We need less monoculture. Trouble is monoculture pays. Write the software once, deploy it everywhere - free money.

I manage a simple Tier-4 cloud application on Azure, involving both Windows and Linux machines. Crowdstrike, OMI, McAfee and endpoint protection in general has been the biggest thorn in my side.

ngneer
0 replies
5h15m

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature."

"The most important property of a program is whether it accomplishes the intention of its user."

C.A.R. Hoare

low_tech_punk
0 replies
2m

Crowdstruck

localfirst
0 replies
2h55m

This is the first time I'm hearing about crowdstrike, what is it and why is this such a big deal?

kubov
0 replies
2h27m

Were cloud providers (AWS and azure) so heavily impacted because they use CS internally or because so many users use CS?

kaladin-jasnah
0 replies
3h52m

Anecdote: my first job was IT at a small org. We had somehow gotten a 15 minute remote meeting with Kevin Mitnick, and asked him several questions about security best practices and software recommendations. I don't remember a lot about that meeting, but I do remember his strong recommendation of Crowdstrike. Interesting to see it brought up again in this context.

josephd79
0 replies
5h10m

Year of Linux

janalsncm
0 replies
1h51m

This outage may be more expensive and cause more damage than any cyberattack in history.

jacobgorm
0 replies
4h9m

The great clownstrike.

irusensei
0 replies
2h2m

Can we end the whole “loading a kernel rootkit” thing? AFAIK Apple already shuns kernel extensions. What’s preventing Microsoft to do the same? As a bonus shit like anti cheat will go away too.

integricho
0 replies
4h43m

This should at the very least put them out of business by causing each and every client to abandon them as their security solution.

integricho
0 replies
4h51m

Ironic that the software intended to prevent exactly these kinds of outages ends up causing it.

insane_dreamer
0 replies
26m

How is it that these major companies aren't rolling out vendor updates to a small number of computers first to make sure that nothing broke, and then rolling out to the entire fleet? That's deployment 101.

gonzo41
0 replies
2h19m

Do people not have test environments?

farceSpherule
0 replies
3h49m

I absolutely abhor these end point solutions that "auto update for your convenience and safety."

I can control and manage my own systems. I do not need nanny state auto updating for me.

Crowdstrike should be held liable for financial losses associated with this nonsense.

ezoe
0 replies
5h36m

Those EDR software is implemented as a kernel driver.

A third party closed source Windows kernel driver that can't be audited. It gathers massive amount of activities and send back to the central server(which can be sold) as well as execute arbitrary payload from the central server.

It became single point of failure to your whole system.

If an attacker gain control of the sysadmin PC, it's over.

If an attacker gain administrator privilege on EDR-installed system, it run the same privilege with EDR so attacker can hide their activities from EDR. There aren't many EDR products in the world it can be done.

I'd like to call it "full trust security model".

etc-hosts
0 replies
38m

Mission critical systems should be running something like ChromeOS.

Too bad ChromeOS seems be on the way out at Google.

energy123
0 replies
1h36m

On the plus side this will help us develop an immune system against cyber attacks in any future war. Businesses will start thinking of contingencies.

egberts1
0 replies
5h21m

I am quite sure that they have had three precious timezone hours to detect a total failure of telemetry after their fateful midnight upgrade.

Like the most useful Canary Island in the Coal Mine.

dev1ycan
0 replies
3h54m

Crazy isn't it, I had no issues because my group policy updates have been off since last year, guess the "everyone must forcefully update" for "security reasons" ended up backfiring, who could've thought

ddgflorida
0 replies
5h0m

Do you suppose they test before pushing updates out?

dboreham
0 replies
3h3m

Looks like it affected the Crowdstrike stock, but not Microsoft.

daemonologist
0 replies
3h54m

My company has some bios bitlocker extension installed which prompts for a password on boot, so automatic updates (one of which tried to install last night) just get stuck there in jet engine mode. Normally this is extremely annoying but today I count myself lucky - aside from a couple of people with Chromebook thin clients I am the only person showing as online in Teams right now.

coderinsan
0 replies
31m

CrowdStrike today has shown why it's absolutely crucial to test code before deployment, say no to YOLO deployments with LLM powered software testing https://github.com/codeintegrity-ai/mutahunter

axelthegerman
0 replies
3h34m

Looks like crowdstrike are just delivering what their name promised, striking crowds around the world

apantel
0 replies
27m

This just in ‘CrowdStrike Strikes Crowd’

amai
0 replies
4h13m

It seems monocultures are not only bad for resilience in agriculture, but also in IT.

aktuel
0 replies
3h33m

Germany is not affected since it's Krautstrike only.

Zaskoda
0 replies
1h41m

I want to add something to the discussion but it's difficult for me to accurately summarize and cite things. In a nutshell, there appears to be a lot of tomfoolery with CrowdStrike and the stuff that happened with the DNC during the 2016 election. Here's some of what I'm talking about:

There's a strong link between the DNC, Hillary, and CrowdStrike. Here's once piece that links a cofounder of CrowdStrike with Hillary pretty far back: https://www.technologyreview.com/innovator/dmitri-alperovitc...

This 2017 piece talks about doubt behind CrowdStrike's analysis of the DNC hack being the result of Russian actors. One of the groups disputing CrowdStrike's analysis was Ukraine's military. https://www.voanews.com/a/crowdstrike-comey-russia-hack-dnc-...

This detailed analysis of CrowdStrike's explanation of the DNC hack goes so far as to say "this sounded made up" https://threatconnect.com/resource/webinar-guccifer-2-0-the-...

The Threat Connect analysis is also discussed here: https://thehill.com/business-a-lobbying/295670-prewritten-gu...

"For one, the vulnerability he claims to have used to hack the NGP VAN ... was not introduced into the code until an update more than three months after Guccifer claims to have entered the DNC system."

Noted at the end of this story they mention that CrowdStrike installed it's software on all of the DNC's systems: https://www.ft.com/content/5eeff6fc-3253-11e6-bda0-04585c31b...

Finally, there's this famous but largely forgotten story of the time Bernie's campaign was accused to accessing Hillary's data: https://www.npr.org/2015/12/18/460273748/bernie-sanders-camp...

"This was a very egregious breach and our data was stolen," Mook said. "We need to be sure that the Sanders campaign no longer has access to our data."

"This bug was a brief, isolated issue, and we are not aware of any previous reports of such data being inappropriately available," the company said in a blog post on its website.

(edited for spelling)

Ringz
0 replies
50m

By chance, I watched a few episodes of 911 and kept thinking that it was all completely unrealistic nonsense. Then there's an episode where the entire emergency call system for LA goes down, and even though there were different reasons in the episode (a transformer fire), I couldn't have imagined that it was actually possible to completely disable the emergency call system (and what else) of a city.

PaulHoule
0 replies
1h57m

People at my workplace were affected but I dodged the bullet because I left my computer turned on overnight because I always want to be able to RDP in the next morning in case I decide to stay home.

Kye
0 replies
1h37m

There's a workaround: reboot 10-15 times. I've seen two people say it independently, so maybe it's for real.

Kye
0 replies
5h16m

Maybe a silly question, but: why hasn't this affected Linux? I assume it uses a proprietary kernel module just like it does on Windows. I guess this will come out in a post-mortem if they publish one, but it's been on my mind.

edit: aha https://news.ycombinator.com/item?id=41005936

They did do this to Linux, but in the past. Maybe whatever they did to deal with it saved Linux this time around

KingOfCoders
0 replies
5h19m

2024 years after 2k we have 2k.

GrumpyNl
0 replies
3h31m

Ho do they test this before they roll it out? Looks like a bug thats easy to spot. I would presume they test it at several configurations and when it passes the test ( a reboot), they roll it out. Has this been tested?

GirishSharma643
0 replies
2h10m

Who is responsible for this billion dollar mistake?

Geezus_42
0 replies
3h18m

"Incidents of this nature do occur in a connected world that is reliant on technology." - Mike Maddison, CEO, NCC Group

Until I see an explanation of how this got past testing, I will assume negligence. I wasn't directly affected, but it seems every single Windows machine running their software in my org was affected. With a hit rate that high I struggle to believe any testing was done.

CKMo
0 replies
2h44m

This is a good example of why you don't want ring0 level access for clients. Or just, you don't want client-based solutions. The provider just becomes another threat vector.