HN comments for: The CrowdStrike file that broke everything was full of null characters?

AndrewKemendo

113 replies

23h17m

2024-07-19 19:12:41 UTC

This should not have passed a competent C/I pipeline for a system in the critical path.

I’m not even particularly stringent when it comes to automated test across-the-board but for this level of criticality of system, you need exceptionally good state management

To the point where you should not roll to production without an integration test on every environment that you claim to support

Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is running stuff over there - total incompetence

martinky24

53 replies

23h13m

2024-07-19 19:17:23 UTC

A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.

We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

EvanAnderson

48 replies

23h11m

2024-07-19 19:19:13 UTC

I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.

gjsman-1000

15 replies

23h10m

2024-07-19 19:20:01 UTC

Perhaps - but if I made a list of all of the things your company should be doing and didn't, or even things that your side project should be doing and didn't, or even things in your personal life that you should be doing and haven't, I'm sure it would be very long.

EvanAnderson

10 replies

23h5m

2024-07-19 19:24:37 UTC

A company deploying kernel-mode code that can render huge numbers of machines unusable should have done better. It's one of those "you had one job" kind of situations.

They would be a gigantic target for malware. Imagine pwning a CDN to pwn millions of client computers. The CDN being malicious would be a major threat.

soraminazuki

6 replies

21h34m

2024-07-19 20:55:26 UTC

Oh, they have one job for sure. Selling compliance. All else isn't their job, including actual security.

Antiviruses are security cosplay that works by using a combination of bug-riddled custom kernel drivers and unsandboxed C++ parsers running with the highest level of privileges to tamper with every bit of data it can get its hands on. They violate every security common sense. They also won't even hesitate to disable or delay rollouts of actual security mechanisms built into browsers and OSes if it gets in the way.

The software industry needs to call out this scam and put them out of business sooner than later. This has been the case for at least a decade or two and it's sad that nothing has changed.

https://ia801200.us.archive.org/1/items/SyScanArchiveInfocon... https://robert.ocallahan.org/2017/01/disable-your-antivirus-...

heraldgeezer

5 replies

20h57m

2024-07-19 21:33:06 UTC

Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

soraminazuki

2 replies

20h34m

2024-07-19 21:56:06 UTC

That anecdote doesn't justify installing gaping security holes into the kernel with those tools. Actual security requires knowledge, good practice, and good engineering. Antiviruses can never be a substitute.

lytedev

1 replies

19h5m

2024-07-19 23:24:48 UTC

You seem security-wise, so surely you can understand that in some (many?) cases, antivirus is totally acceptable given the threat model. If you are wanting to keep the script kiddies from metasploiting your ordinary finance employees, it's certainly worth the tradeoff for some organizations, no? It's but one tool with its tradeoffs like any tool.

TeMPOraL

0 replies

10h53m

2024-07-20 07:36:57 UTC

That's like pointing at the occasional petty theft and mugging, and using it to justify establishing an extraordinary secret police to run the entire country. It's stupid, and if you do it anyway, it's obvious you had other reasons.

Antivirus software is almost universally malware. Enterprise endpoint "protection" software like CrowdStrike is worse, it's an aggressive malware and a backdoor controlled by a third party, whose main selling points are compliance and surveillance. Installing it is a lot like outsourcing your secret police to a consulting company. No surprise, everything looks pretty early on, but two weeks in, smart consultants rotate out to bring in new customers, and bad ones rotate in to run the show.

Yeah, that's definitely a good tradeoff against script kiddies metasploiting your ordinary finance employees. Wonder if it'll look as good when loss of life caused by CrowdStrike this weekend gets tallied up.

jmb99

0 replies

18h52m

2024-07-19 23:38:19 UTC

How many attacks have they stopped that would have DoS’d a significant fraction of the world’s Windows machines roughly instantly?

The ends don’t justify the means.

jjav

0 replies

15h9m

2024-07-20 03:20:39 UTC

Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

Yes, occasionally they do. This is not an either-or situation.

While they do catch and stop attacks, it is also true that crowdstrike and its ilk are root-level backdoors into the system that bypass all protections and thus will cause problems sometimes.

cduzz

2 replies

21h33m

2024-07-19 20:57:12 UTC

Which is their "One Job" ?

Options include:

1. protected the systems always work even if things are messed up

2. protected systems are always protected even when things are messed up

The two failure modes are exclusive; ideally you let the end user decide what to do if the protection mechanism is itself unstable.

One could suggest "the system must always work" but that's ignoring that sometimes things don't go to plan.

None of the systems in boot loops were p0wned by known exploits while they were boot looping. As far as we know anyhow.

(edited to add the obvious default of "just make a working system" which is of course both a given and not going to happen)

jmb99

0 replies

18h45m

2024-07-19 23:44:52 UTC

The failure mode here was a page fault due to an invalid definition file. That (likely) means the definition file was being used as-is without any validation, and pointers were being dereferenced based on that non-validated definition file. That means this software is likely vulnerable to some kind of kernel-level RCE through its definition files, and is (clearly) 100% vulnerable to DoS attacks through invalid definition files. Who knows how long this has been the case.

This isn’t a matter of “either your system is protected all the time, even if that means it’s down, or your system will remain up but might be unprotected.” It’s “your system is vulnerable to kernel-level exploits because of your AV software’s inability to validate definition files.”

The failure mode here should absolutely not be to soft-brick the machine. You can have either of your choices configurable by the sysadmin; definition file fails to validate? No problem, the endpoint has its network access blocked until the problem can be resolved. Or, it can revert to a known-good definition, if that’s within the organization’s risk tolerance.

But that would require competent engineering, which clearly was not going on here.

EvanAnderson

0 replies

17h42m

2024-07-20 00:47:33 UTC

Their "one job" is to not make things worse than the default. DoS'ing the OS with an unhandled kernel mode exception would be not doing that job.

How about a different analogy: First do no harm.

jjav

2 replies

22h23m

2024-07-19 20:06:54 UTC

all of the things your company should be doing and didn't

Processes need to match the potential risk.

If your company is doing some inconsequential social app or whatever, then sure, go ahead and move fast and break things if that's how you roll.

If you are a company, let's call them Crowdstrike, that has access to push root privileged code to a significant percentage of all machine on the internet, the minimum quality bar is vastly higher.

For this type of code, I would expect a comprehensive test suite that covers everything and a fleet of QA machines representing every possible combination of supported hardware and software (yes, possibly thousands of machines). A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

Anything short of that is highly irresponsible given the access and risk the Crowdstrike code represents.

Denvercoder9

1 replies

20h59m

2024-07-19 21:31:02 UTC

A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

That doesn't work in the business they're in. They need to roll out definition updates quickly. Their clients won't be happy if they get compromised while CrowdStrike was still doing the dogfooding or phased rollout of the update that would've prevented it.

jjav

0 replies

15h13m

2024-07-20 03:16:27 UTC

That doesn't work in the business they're in. They need to roll out definition updates quickly.

Well clearly we have incontrovertible evidence now (if it was needed) that YOLO-pushing insufficiently tested updates to everyone at once does not work either.

This is being called in many places (righfully) the largest IT outage in history. How many billions will be the cost? How many people died?

So yes, clearly not the correct way to operate.

bn-l

0 replies

22h55m

2024-07-19 19:34:28 UTC

I think in this case it’s reasonable for us to expect that they are doing what they should be doing.

AdamJacobMuller

15 replies

22h4m

2024-07-19 20:25:54 UTC

The file was just full of null bytes.

It's very possible the signature validation and verification happens after the bug was triggered.

usr1106

11 replies

21h39m

2024-07-19 20:51:23 UTC

Haven't used Windows for close to 15 years, but I read the file is (or rather supposed to be) a NT kernel driver.

Are those drivers signed? Who can sign them? Only Microsoft?

If it's true the file contained nothing but zeros that seems to be also kernel vulnerability. Even if signing were not mandatory, shouldn't the kernel check for some structure, symbol tables or the the like before proceeding?

dagaci

4 replies

21h7m

2024-07-19 21:22:56 UTC

Think more, imagine that the your CrowdStrike security layer detects an 'unexpected' kernel level data file.

Choice #1 Diable security software and continue. Choice #2 Stop. BSOD message contact you administrator

There may be nothing wrong with the drivers.

derefr

3 replies

20h34m

2024-07-19 21:56:07 UTC

Choice #3 structure the update code so that verifying the integrity of the update (in kernel mode!) is upstream of installing the update / removing the previous definitions package, such that a failed update (for whatever reason) results in the definitions remaining in their existing pre-update state.

(This is exactly how CPU microcode updates work — the CPU “takes receipt” of the new microcode package, and integrity-verifies it internally, before starting to do anything involving updating.)

warkdarrior

2 replies

19h59m

2024-07-19 22:30:52 UTC

a failed update (for whatever reason) results in the definitions remaining in their existing pre-update state

Fantastic solution! You just gave the attackers a way to stop all security updates to the system.

monocasa

0 replies

18h5m

2024-07-20 00:24:49 UTC

When you can't verify an update, rolling back atomically to the previous state is generally considered the safest option. Best run what you can verify was a complete package from whoever you trust.

JohnBooty

0 replies

14h1m

2024-07-20 04:28:47 UTC

No, that doesn't follow.

For most systems, a sensible algorithm would be "keep running the last known good definition, until we get the next known good definition"

In other words: ignore bad updates but keep checking for valid ones. That doesn't mean you've permanently lost the ability to update.

Of course, for some systems, different behavior might make more sense.

poizan42

2 replies

20h52m

2024-07-19 21:37:39 UTC

No the file is not a driver. It's a file loaded by a driver, some sort of threat/virus definition file I think?

And yes Windows drivers are signed. If it had been a driver it would just have failed to load. Nowadays they must be signed by Microsoft, see https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

MBCook

1 replies

19h58m

2024-07-19 22:32:14 UTC

That was my read.

The kernel driver was signed. The file it loaded as input with garbage data had seemingly no verification on it at all, and it crashed the driver and therefore the kernel.

usr1106

0 replies

19h35m

2024-07-19 22:54:25 UTC

Hmm, the driver must be signed (by Microsoft I assume). So they sign a driver which in turn loads unsigned files. That does not seem to be good security.

rahkiin

1 replies

21h3m

2024-07-19 21:26:54 UTC

The file was data used by the actual driver like some virus database. It is not code loaded by the kernel

kabdib

0 replies

19h5m

2024-07-19 23:24:27 UTC

Yet it was named ".sys", an extension normally reserved for driver executables AFAIK

Brillant! [sic]

anonymfus

0 replies

20h13m

2024-07-19 22:17:00 UTC

NT kernel drivers are Portable Executables, and kernel does such checks, displaying BSOD with stop code 0xC0000221 STATUS_IMAGE_CHECKSUM_MISMATCH if something went wrong.

https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

wk_end

2 replies

21h58m

2024-07-19 20:31:43 UTC

"Load a kernel module and then verify it" is not the way any remotely competent engineer would do things.

(...which doesn't rule out the possibility that CS was doing it.)

justinclift

1 replies

20h52m

2024-07-19 21:37:50 UTC

The ClownStrike Falcon software that runs on both Linux and macOS was incredibly flaky and a constant source of kernel problems at my previous work place. We had to push back on it regardless of the security team's (strongly stated) wishes, just to keep some of the more critical servers functional.

Pretty sure "competence" wasn't part of the job description of the ClownStrike developers, at least for those pieces. :( :( :(

soraminazuki

0 replies

19h50m

2024-07-19 22:39:29 UTC

ClownStrike left kernel panics unfixed for a year until macOS deprecated kernel extensions altogether. It was scary because crash logs indicated that memory was corrupted while processing network packets. It might've been exploitable.

xyst

11 replies

22h11m

2024-07-19 20:18:57 UTC

Hindsight is 20/20

This is a public company after all. In this market, you don’t become a “Top-Tier Cybersecurity Company At A Premium Valuation” with amazing engineering practices.

Priority is sales, increasing ARR, and shareholders.

StressedDev

4 replies

19h58m

2024-07-19 22:32:08 UTC

Not caring about the actual product will eventually kill a company. All companies have to constantly work to maintain and grow their customer base. Customers will eventually figure out if a company is selling snake oil, or a shoddy product.

Also, the tech industry is extremely competitive. Leaders frequently become laggards or go out of business. Here are some companies who failed or shrank because their products could not complete: IBM, Digital Equipment, Sun, Borland, Yahoo, Control Data, Lotus (later IBM), Evernote, etc. Note all of these companies were at some point at the top of their industry. They aren't anymore.

geodel

1 replies

19h40m

2024-07-19 22:49:32 UTC

Keyword is eventually. By then C-level would've been retired. Others in top management would've changed multiple jobs.

IMO point is not where are these past top companies now but where are top people in those companies now. I believe they end up being in very comfortable situation no matter which place.

Exceptions of course would be criminal prosecution, financial frauds etc.

feoren

0 replies

19h14m

2024-07-19 23:15:25 UTC

Bingo! It's the Principal Agent Problem. People focus too much on why companies do X and companies do Y, it's bad in the long term. The long term doesn't exist. No decision maker at these public companies gives a rat's ass about "the long term", because their goal is to parasitize from the company and fly off to another host before the damage they did becomes apparent. And they are very good at it: it's literally all they do. It's their entire profession.

worik

0 replies

19h54m

2024-07-19 22:35:31 UTC

Not caring about the actual product will eventually kill a company.

Eventually

By then the principles are all very rich, and no longer care.

Do you think Bill Gates sleeps well?

jjav

0 replies

15h17m

2024-07-20 03:12:45 UTC

Not caring about the actual product will eventually kill a company.

Eventually is a long time.

Unfortunately for all of us ("us" being not just software engineers, but everyone impacted by this and similar lack of proper engineering outcomes) it is a proven path to wealth and success to ignore engineering a good product. Build something adequate on the surface and sell it like crazy.

Yeah, eventually enough disasters might kill the company. Countless billions of dollars will have been made and everyone responsible just moves on to the next one. Rinse & repeat.

fsloth

3 replies

21h33m

2024-07-19 20:56:45 UTC

This is the market. Good engineering practices don’t hurt but they are not mandatory. If Boeing can wing it so can everybody.

StressedDev

1 replies

19h56m

2024-07-19 22:33:28 UTC

Boeing has been losing market share to AirBus for decades. That is what happens when you cannot fix your problems, sell a safe product, keep costs in line, etc.

bsaul

0 replies

19h10m

2024-07-19 23:19:57 UTC

i wonder how far from the edge a company driven by business people can go before they start to put the focus back on good engineering. Probably much too late in general. Business bonus are yearly, and good/bad engineering practices take years to really make a difference.

moomin

0 replies

19h9m

2024-07-19 23:21:01 UTC

The question then becomes: if the market is producing near-monopolies of stuff that is barely fit for purpose, how do we fix the market?

MBCook

1 replies

19h59m

2024-07-19 22:30:39 UTC

That’s too much of an excuse.

This isn’t hindsight. It’s “don’t blow up 101” level stuff they messed up.

It’s not that this got past their basic checks, they don’t appear to have had them.

So let’s ask a different question:

The file parser in their kernel extension clearly never expected to run into an invalid file, and had no protections to prevent it from doing the wrong thing in the kernel.

How much you want to bet that module could be trivially used to do a kernel exploit early in boot if you managed to feed it your “update” file?

I bet there’s a good pile of 0-days waiting to be found.

And this is security software.

This is “we didn’t know we were buying rat poison to put in the bagels” level dumb.

Not “hindsight is 20/20”.

SoftTalker

0 replies

19h31m

2024-07-19 22:58:45 UTC

Truly an "the emperor has no clothes" moment.

chatmasta

3 replies

20h0m

2024-07-19 22:29:34 UTC

The actual bug is not that they pushed out a data file with all nulls. It’s that their kernel module crashes when it reads this file.

I’m not surprised that there is no test pipeline for new data files. Those aren’t even really “build artifacts.” The software assumes they’re just data.

But I am surprised that the kernel module was deployed with a bug that crashed on a data file with all nulls.

(In fact, it’s so surprising, that I wonder if there is a known failing test in the codebase that somebody marked “skip” and then someone else decided to prove a point…)

Btw: is that bug in the kernel module even fixed? Or did they just delete the data file filled with nulls?

hansvm

1 replies

19h49m

2024-07-19 22:40:27 UTC

Btw: is that bug in the kernel module even fixed? Or did they just delete the data file filled with nulls?

Is that a real question? They definitely didn't do anything more than delete the file, perhaps just rename it.

chatmasta

0 replies

18h25m

2024-07-20 00:05:10 UTC

Yeah they have been very obfuscatory in calling this a “fix.” I watched the CEO on Cramer and he kind of danced around this point.

SoftTalker

0 replies

19h34m

2024-07-19 22:55:43 UTC

The instructions that my employer emailed were:

  1. Start Windows in Safe Mode or the Windows Recovery Environment (Windows 11 option).
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
  3. Locate the file matching C-00000291*.sys and delete it.
  4. Restart your device normally.

chrisjj

3 replies

21h26m

2024-07-19 21:03:59 UTC

it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something

I.e. only one link in the chain wasn't tested.

Sorry, but that will not do.

We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

The parent post did not suggest they don't test anything. It suggested they did not test the whole chain.

martinky24

2 replies

20h4m

2024-07-19 22:25:58 UTC

From the parent comment:

it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support

I know nothing about Crowdstrike, but I can guarantee that "they need to test target images that they claim to support" isn't what went wrong here. The implication that they don't test against Windows is so incredulous, it's hard to take the poster of that comment seriously.

StressedDev

1 replies

19h51m

2024-07-19 22:39:03 UTC

Thank you for pointing this out. Whenever I read articles about security, or reliability failures, it seems like the majority of the commenters assume that the person or organization which made the mistake is a bunch of bozos.

The fact is mistakes happen (even huge ones), and the best thing to do is learn from the mistakes. The other thing people seem to forget is they are probably doing a lot of the same things which got CrowdStrike into trouble.

If I had to guess, one problem may be that CrowdStrike's Windows code did not validate the data it received from the update process. Unfortunately, this is very common. The lesson is to validate any data received from the network, from an update process, received as user input, etc. If the data is not valid, reject it.

Note I bet at least 50% of the software engineers commenting in this thread do not regularly validate untrusted data.

chrisjj

0 replies

18h23m

2024-07-20 00:06:56 UTC

I'll bet 50% aren't delivering code that can stiff millions of PCs.

And given Crowdstrike are, and data validation neglect is so common, why have they not already learned this lesson?

arp242

26 replies

21h36m

2024-07-19 20:53:59 UTC

Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is saying they don't have that? Who is saying it didn't pass all of that?

You're making tons of assumptions here.

martinky24

12 replies

21h35m

2024-07-19 20:55:01 UTC

Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.

chrisjj

6 replies

21h29m

2024-07-19 21:01:07 UTC

So let's hear the "nuances" that excuse this.

arp242

3 replies

20h57m

2024-07-19 21:32:42 UTC

I am not defending of excusing anything. I am saying there is not enough information to make a judgement one way or the other. Right now, we have almost zero technical details.

Call me old-fashioned and boring, but I'd like to have some basic facts about the situation first. After this I decide who does and doesn't deserve a bollocking.

chrisjj

2 replies

20h24m

2024-07-19 22:06:00 UTC

I think we do have enough info to judge e.g. :This should not have passed a competent C/I pipeline for a system in the critical path."

Thay info includes that the faulty file consisted entirely of zeros.

arp242

1 replies

19h44m

2024-07-19 22:46:23 UTC

That info includes that the faulty file consisted entirely of zeros.

Even that is not certain. Some people are reporting that this isn't the case and that the all-zeroed file may be a "quick hack" to send out a no-op.

So no, we have very little info.

chrisjj

0 replies

18h21m

2024-07-20 00:08:31 UTC

But the all-zero file is version CS has IDed as the cause, right?

jacobr1

0 replies

20h57m

2024-07-19 21:33:15 UTC

Not an excuse - they should be testing for this exact thing - but Crowdstrike (and many similar security tools) have a separation between "signature updates" and "agent/code" updates. My (limited) reading of this situation is that this as a update of their "data" not the application. Now apparently the dynamic update included operating code, just just something the equivalent of a yaml file or whatever, but I can see how different kinds of changes like this go through different pipelines. Of course, that is all the more reason to ensure you have integration coverage.

cweld510

0 replies

21h13m

2024-07-19 21:16:54 UTC

It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”

Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the things that would have prevented this? If they cut corners, why? It’s not useful or productive to throw around accusations or demands for specific improvements without answering questions like these.

AndrewKemendo

4 replies

21h0m

2024-07-19 21:29:50 UTC

There is no nuance needed - this is a giant corporation that sells kernel layer intermediation at global scale. You better be spending billions on bulletproof deployment automation because *waves hands around in the air pointing at whats happening just like with solarwinds*

Bottom line this was avoidable and negligent

For the record I owned global infrastructure as CTO for the USAF Air Operations weapons system - one of the largest multi-classification networked IT systems ever created for the DoD - even moreso during a multi-region refactor as a HQE hire into the AF

So I don’t have any patience for millionaires not putting the work in when it’s critical infrastructure

People need to do better and we need accountability for people making bad decisions for money saving

jonathanzufi

1 replies

19h19m

2024-07-19 23:10:48 UTC

You must have insanely cool stories :-)

What are your thoughts on MSFTs role in this?

They’ve been iterating Windows since 1985 - doesn’t it seem reasonable that their kernel should be able to survive a bad 3rd party driver?

AndrewKemendo

0 replies

17h25m

2024-07-20 01:04:33 UTC

1. System high/network isolation is a disaster in practice and is the root of MSFT and AD/ADFS architecture

2. The problem is the ubiquity of windows so it’s embedded in the infrastructure

We’ve put too many computers in charge of too much stuff for the level of combined capabilities of the computer and the human operator interface

arp242

0 replies

20h51m

2024-07-19 21:38:42 UTC

Almost everything that goes wrong in the world is avoidable one way or the other. Simply stating "it was avoidable" as an axiom is simplistic to the point of silliness.

Lots of very smart people have been hard at work to prevent airplanes from crashing for many decades now, and planes still crash for all sorts of reasons, usually considered "avoidable" in hindsight.

Nothing is "bulletproof"; this is a meaningless buzzword with no content. The world is too complex for this.

HL33tibCe7

0 replies

20h30m

2024-07-19 21:59:29 UTC

You better be spending billions on bulletproof deployment automation

There is no such thing.

JKCalhoun

11 replies

21h23m

2024-07-19 21:06:35 UTC

To be sure. But the fact is the release broke.

I'm not sure: is having test servers that it passed any better than none at all?

strken

7 replies

20h32m

2024-07-19 21:58:02 UTC

It is absolutely better to catch some errors than none.

In this case it gives me vibes of something going wrong after the CI pipeline, during the rollout. Maybe they needed advice a bit more specific than "just use a staging environment bro", like "use checksums to verify a release was correctly applied before cutting over to the new definitions" and "do staged rollouts, and ideally release to some internal canary servers first".

martinky24

3 replies

20h13m

2024-07-19 22:16:53 UTC

"Have these idiots even heard of CI/CD???" strangely seems to be a common condescending comment in this thread.

I honestly though HN was slightly higher quality than most of the comments here. I am proven wrong.

kristjansson

0 replies

20h0m

2024-07-19 22:30:12 UTC

Big threads draw a lot of people; we regress toward the mean

chuckadams

0 replies

19h40m

2024-07-19 22:49:47 UTC

I honestly though HN was slightly higher quality

HN reminds me of nothing so much as Slashdot in the early 2000's, for both good and ill. Fewer stupid memes about Beowulf Clusters and Natalie Portman tho.

StressedDev

0 replies

19h49m

2024-07-19 22:40:25 UTC

Agreed - The worst part is most of the people making these unhelpful comments are probably doing the same sorts of things which caused this outage.

chuckadams

1 replies

19h41m

2024-07-19 22:49:06 UTC

They almost certainly have such a process, but it got bypassed by accident, probably got put into a "minor updates" channel (you don't run your model checker every time you release a new signature file after all). Surprise, business processes have bugs too.

But naw, must be every random commentator on HN knows how to run the company better.

TeMPOraL

0 replies

10h42m

2024-07-20 07:48:18 UTC

(you don't run your model checker every time you release a new signature file after all)

Wonder if the higher-ups who mandated this software to be installed in their hospitals were informed about that fact.

exe34

0 replies

20h23m

2024-07-19 22:06:40 UTC

I don't understand why you wouldn't do staged roll outs at this scale. even a few hours delay might have been enough to stop the release from going global.

martinky24

0 replies

21h4m

2024-07-19 21:25:37 UTC

Yes, yes it is. Because there's tons more breakages that have likely been caught.

One uncaught downstream failure doesn't invalidate the effort into all the previously caught failures.

leptons

0 replies

19h55m

2024-07-19 22:35:10 UTC

A lot of the software industry focuses on strong types, testing of all kinds, linting, and plenty of other sideshows that make programmers feel like they're in control, but these things only account for the problems you can test for and the systems you control. So what if a function gets a null instead of a float? It shouldn't crash half the tech-connected world. Software resilience is kind of lacking in favor of trusting that strong types and tests will catch most bugs, and that's good enough?

chatmasta

0 replies

19h57m

2024-07-19 22:32:40 UTC

The release didn’t break. A data file containing nulls was downloaded by a buggy kernel module that crashed when reading the file.

For all we know there is a test case that failed and they decided to push the module anyway (“it’s not like anyone is gonna upload a file of all nulls”).

Btw: where are these files sourced from? Could a malicious Crowdstrike customer trick the system into generating this data file, by e.g. reporting it saw malware with these (null) signatures?

ikiris

0 replies

19h39m

2024-07-19 22:50:25 UTC

Dude, the fact that it breaks directly.

You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.

russdill

7 replies

21h58m

2024-07-19 20:31:50 UTC

You can have all the CI, staging, test, etc. If some bug after that process nulls the file, the rest doesn't matter

fabian2k

1 replies

21h50m

2024-07-19 20:40:09 UTC

Those signature files should have a checksum, or even a digital signature. I mean even if it doesn't crash the entire computer, a flipped bit in there could still turn the entire thing against a harmless component of the system and lead to the same result.

HL33tibCe7

0 replies

20h28m

2024-07-19 22:01:36 UTC

What happens when your mechanism for checksumming doesn't work? What happens when your mechanism for installing after the checksum is validated doesn't work?

It's just too early to tell what happened here.

The likelihood is that it _was_ negligence. But we need a proper post-mortem to be able to determine one way or another.

Jtsummers

1 replies

21h55m

2024-07-19 20:35:17 UTC

If a garbage file is pushed out, the program could have handled it by ignoring it. In this case, it did not and now we're (the collective IT industry) dealing with the consequences of one company that can't be bothered to validate its input (they aren't the only ones, but this is a particularly catastrophic demonstration of the importance of input validation).

russdill

0 replies

21h48m

2024-07-19 20:42:23 UTC

I'll agree that this appears to have been preventable. Whatever goes through CI should have a hash, deployment should validate that hash, and the deployment system itself should be rigorously tested to insure it breaks properly if the hash mismatches at some point in the process

jononor

0 replies

19h30m

2024-07-19 22:59:46 UTC

The issue is not that a file with nulls was produced. It is that an invalid file (or any kind) can trigger a blue screen of death.

LorenPechtel

0 replies

21h22m

2024-07-19 21:07:24 UTC

Yup. I had quite a battle with some sort of system bug (never fully traced) where I wrote valid data but what ended up on disk was all zero. It appeared to involve corrupted packets being accepted as valid.

It doesn't matter how much you test if something down the line zeroes out your stuff.

Cerium

0 replies

21h15m

2024-07-19 21:14:43 UTC

What sort of sane system modifies the build output after testing?

Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.

carterschonwald

6 replies

22h11m

2024-07-19 20:19:00 UTC

The strange thing is that when I interviewed there years ago with the team that owns the language that runs in the kernel, they said their ci has 20k or 40k machine os combinations/configurations. Surely some of them were vanilla windows!

dboreham

5 replies

21h41m

2024-07-19 20:49:02 UTC

They used synthetic test data in CI that doesn't consist of zeros.

dlisboa

4 replies

21h27m

2024-07-19 21:02:51 UTC

Fuzz testing would've saved the day here.

azemetre

2 replies

21h2m

2024-07-19 21:27:39 UTC

I’m sure some team had it in their backlog for years.

queuebert

0 replies

20h31m

2024-07-19 21:58:39 UTC

That team was probably laid off because they weren't shipping product fast enough.

0x6c6f6c

0 replies

20h38m

2024-07-19 21:51:37 UTC

Oh yeah, FEAT#927261? Would love to see that ticket go out

TeMPOraL

0 replies

10h34m

2024-07-20 07:55:53 UTC

Why not? It's unlikely it was the last null byte in the data file that killed the driver.

0xcafecafe

6 replies

23h7m

2024-07-19 19:23:07 UTC

They could even have done slow rollouts. Roll it out to a geographical region and wait an hour or so before deploying elsewhere.

xyst

2 replies

22h9m

2024-07-19 20:20:41 UTC

Or test in local environments first. Slow rollouts like this tend to make deployments very very painful.

koliber

1 replies

21h32m

2024-07-19 20:58:05 UTC

Slow rollouts can be quite quick. We used to do 3-day rollouts. Day one was a tiny fraction. Day two was about 20%. Day three was a full rollout.

It was ages ago, but from what I remember, the first day rollout did occasionally catch issues. It only affected a small number of users and the risk was within the tolerance window.

We also tested locally before the first rollout.

rplnt

0 replies

21h17m

2024-07-19 21:12:24 UTC

I don't know about this particular update, but when I used to work for an AV vendor we did like 4 "data" updates a day. It is/was about being quick a lot of the time, you can't stage those over 3 days. Program updates are different, drivers of this level were very different (Microsoft had to sign those, among many things).

Not thay it exuces anything, just that this probably wasn't treated as an update at all.

saati

1 replies

21h5m

2024-07-19 21:24:51 UTC

In theory CrowdStrike protects you from threats, leaving regions unprotected for an hour would be an issue.

Thaxll

0 replies

19h55m

2024-07-19 22:34:33 UTC

Not really, even for security updates are not needed by the minute. Do you think Microsoft rollout world wide updates to everyone?

daseiner1

0 replies

21h37m

2024-07-19 20:52:54 UTC

You say even (emphasis mine). Is this not industry standard?

dheera

2 replies

21h26m

2024-07-19 21:03:32 UTC

I don't know if people on Microsoft ecosystems even know what CI pipelines are.

Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.

Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.

It has always worked this way.

But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.

* A disproportionately large number of Windows IT guys are named Ron, in my experience.

kabdib

1 replies

20h59m

2024-07-19 21:31:15 UTC

That's a pretty broad brush.

jfyi

0 replies

18h45m

2024-07-19 23:45:01 UTC

Eh, it's not too broad... I think we should ask how Ron feels about the characterization though.

miki123211

1 replies

21h1m

2024-07-19 21:29:17 UTC

Keep in mind that this was probably a data file, not necessarily a code file.

It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.

Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"

I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.

Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.

This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.

adzm

0 replies

20h39m

2024-07-19 21:50:26 UTC

This was a .dll in all but name fwiw.

sonotathrowaway

0 replies

22h16m

2024-07-19 20:14:16 UTC

That’s not even getting into the fuckups that must have happened to allow a bad patch to get rolled out everywhere all at once.

notabee

0 replies

22h33m

2024-07-19 19:56:36 UTC

Without delving into any kind of specific conspiratorial thinking, I think people should also include the possibility that this was malicious. It's much more likely to be incompetence and hubris, but ever since I found out that this is basically an authorized rootkit, I've been concerned about what happens if another Solarwinds incident occurs with Crowdstrike or another such tool. And either way, we have the answer to that question now: it has extreme consequences. We really need to end this blind checkbox compliance culture and start doing real security.

hnlmorg

0 replies

21h52m

2024-07-19 20:38:12 UTC

It seems unlikely that a file entirely full of null characters was the output of any automated build pipeline. So I’d wager something got built, passed the CI tests, then the system broke at some point after that when the file was copied ready for deployment.

But at this stage, all we are doing is speculating.

dagaci

0 replies

21h45m

2024-07-19 20:45:13 UTC

/* Acceptance criteria #1: do not allow machine to boot if invalid data signatures are present, this could indicate a compromised system. Booting could cause presidents diary to transmit to rival 'Country' of the week */

if(dataFileIsNotValid) { throw FatalKernelException("All your base are compromised"); }

EDIT+ Explanation:

With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.

The machines should not fully boot until file with valid signature is downloaded.*

ar_lan

0 replies

20h26m

2024-07-19 22:04:00 UTC

tests all of the possible target images that they claim to support.

Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.

0cf8612b2e1e

47 replies

23h5m

2024-07-19 19:24:54 UTC

On the plus side of this disaster, I am holding out some pico-sized hope that maybe organizations will rethink kernel level access. No, random gaming company, you are not good enough to write kernel level anti cheat software.

pvillano

37 replies

22h10m

2024-07-19 20:19:47 UTC

imo anti-cheat should mostly be server-side behavior based

gruez

36 replies

22h8m

2024-07-19 20:21:57 UTC

How are you going to catch wallhackers that aren't blatantly obvious?

mrguyorama

14 replies

21h54m

2024-07-19 20:35:40 UTC

The only reason wallhacking is possible in the first place is a server sending a client information on a competitor that the client should not know about.

IE the server sends locations and details about all players to your client, even if you are in the spawn room and can't see anyone else and your client has to hide those details from you. It is then trivial to just pull those details out of memory.

The solution forever has been to just not send clients information they shouldn't have. My copy of CS:GO should not know about a terrorist on the other side of the map. The code to evaluate that literally already exists, since the client will answer that question when it goes to render visuals and sound. They just choose to not do that testing server side.

Aimbotting however is probably impossible to stop. Your client has to know where the model for an enemy is to render it, so you know where the hitbox roughly should be, and most games send your client the hitbox info directly so it can do predict whether you hit them. I don't think you can do it behaviorally either.

snailmailman

3 replies

21h44m

2024-07-19 20:45:57 UTC

To some extent though- the games do need information about players that are behind walls. In CSGO/CS2, even if you can’t see the player you can hear their footsteps or them reloading, etc. the sound is very positional. Plus, you can shoot through some thin walls at these players. Even if they can’t be seen.

I don’t believe server side anti cheat can truly be effective against some cheats. But also Vanguard is trash and makes my computer bluescreen. I’ve stopped playing league entirely because of it.

pohuing

1 replies

20h34m

2024-07-19 21:55:52 UTC

You don't happen to have used some means to install win 11 on an unsupported device have you? People bypassing the windows install requirements and then vanguard making false assumptions have been a source of issues.

snailmailman

0 replies

2h23m

2024-07-20 16:06:28 UTC

I’m on windows 10. Vanguard always complains about some driver that it prevents from running. (Might be fan control I think? Hard to figure out) And in addition to that it’s caused several blue screens for me. (Kernel access violations I think?) The blue screens say the error was in vgk.sys which is vanguard. It’s not at all consistent, but when it happens it’s as I’m getting into a game, so I miss part of a match for it every time.

0cf8612b2e1e

0 replies

21h39m

2024-07-19 20:50:52 UTC

Nit, but surely hit detection happens on the server? Shooting wildly should always register a hit, regardless of what the client knows.

SigmundA

3 replies

21h24m

2024-07-19 21:06:08 UTC

So the server must render the 3d world from each players perspective to do these tests? Sounds ridiculously expensive.

Ukv

2 replies

20h58m

2024-07-19 21:31:35 UTC

So the server must render the 3d world from each players perspective to do these tests?

Just some raycasts through the geometry should be sufficient, which the server is already doing (albeit on likely-simplified collision meshes) constantly.

If you really do have a scenario where occlusion noticeably depends on more of the rendering pipeline (a window that switches between opaque and transparent based on a GPU shader?) you could just treat it as always transparent for occlusion checking and accept the tiny loss that wallhackers will be able to see through it, or add code to simulate that server-side and change the occlusion geometry accordingly.

SigmundA

1 replies

18h48m

2024-07-19 23:42:04 UTC

Server side hit detection is nowhere near as complex as occlusion, you need to make sure you account for latency or you get people popping into existence after they round the corner.

Here is one for CS:GO, server occlusion geometry must be made separately as auto generating from map is another hard problem.

Not "just some raycasts" otherwise everyone would be doing it:

https://github.com/87andrewh/CornerCullingSourceEngine

Ukv

0 replies

17h40m

2024-07-20 00:49:51 UTC

Not "just some recast" otherwise everyone would be doing it:

Only needing raycasts through the geometry was in response to the idea that the server would need to "render the 3d world from each players perspective to do these tests". I don't intend to claim that it's as a whole an easy engineering problem, but rather that the tests don't need to involve much/any of the rendering pipeline.

kjkjadksj

1 replies

21h15m

2024-07-19 21:14:46 UTC

Of course you can. You can measure telemetry like where the aimpoint is on a hitbox. Is it centered or at least more accurate than your globabl population? Hacker, ban. How about time to shoot after hitting target? Are they shooting instantly, is the delay truly random? If not then banned. You can effectively force the hacking tools to only be about as good as a human player, at which point it hardly matters whether you have hackers or not.

Of course, no one handles hacking like this because its cheaper to just ship fast and early and never maintain your servers. Not even valve cares about their games and they are the most benevolent company in the industry.

nemothekid

0 replies

20h53m

2024-07-19 21:37:05 UTC

Valve does not have kernel level anticheat. Faceit does. Most high ranked players prefer to play on Faceit because of the amount of cheaters in normal CS2 matchmaking.

bsder

0 replies

21h29m

2024-07-19 21:00:43 UTC

The only reason wallhacking is possible in the first place is a server sending a client information on a competitor that the client should not know about.

Some information is required to cover the network and server delays.

The client predicts what things should look like and then corrects to what they actually are if there is a discrepancy with the server. You cannot get around this short of going back to in-person LAN games.

anonymoushn

0 replies

21h43m

2024-07-19 20:46:39 UTC

You may have players complain that when they walk around a corner, the enemy who they should be able to see immediately is briefly invisible.

andy81

0 replies

21h33m

2024-07-19 20:56:38 UTC

Aside from aimbots, there's plenty of abusable legitimate information exposed to the client.

E.g. For CS:GO, the volume of footsteps and gunshots vary by distance so you could use them to triangulate an enemy's position.

Bognar

0 replies

17h39m

2024-07-20 00:50:47 UTC

You have no idea what you're talking about. Even mediocre FPS players use positional sound as a huge cue for how to react, which means the client is going to know the positions of every player and non-player entity in a medium range. That's a _huge_ advantage to any hacked client.

Even if the FPS had no sound, an aimbot-free hacked client that knows the positions of only the players that you can see would still provide a significant benefit because a high fidelity radar and pointers can be placed on the screen. No one can sneak up on you. And no you can't base it on which way the player is looking because you can turn around faster than network latency.

Can you limit the impact of hacked clients? Sure, people might not be able to do egregious stuff like X-ray your position from across the map. Locally, though, game clients need a large amount of information to provide a good experience.

JasonSage

7 replies

22h1m

2024-07-19 20:28:33 UTC

You may not, and that's ok.

chowells

6 replies

21h48m

2024-07-19 20:42:19 UTC

It's not ok for people playing those games. They'll quit playing that game and go to one with invasive client-side anti-cheat instead.

The incentives and priorities are very different for people who want to play fair games than they are for people who want to maximize their own freedom.

kjkjadksj

3 replies

21h19m

2024-07-19 21:10:43 UTC

This is a solved issue already. Vote kicks or server admin intervention. Aimbotting was never an issue for the old primitive fps games I would play because admins could spectate and see you are aimbotting.

A modern game need only telemetry that captures what a spectating admin picks up, rather than active surveillance.

Hackers are only a problem when servers are left unmoderated and players can’t vote kick.

nemothekid

1 replies

20h56m

2024-07-19 21:33:38 UTC

You can't have vote kicks/server admins/hosted servers with competitive ranked ladders. If your solution is "don't have competitive ranked ladders" then you are just telling the majority of people who even care about anti-cheat to just not play their preferred game mode.

kjkjadksj

0 replies

17h13m

2024-07-20 01:16:36 UTC

Why can’t you have that with competitive ladders? Presumably theres still mechanisms to kick people in game if they start for example spewing racist messages in the game. What difference is it to kick someone one way or another? Not to mention plenty of games with vote kick mechanisms did have strong competitive scenes.

chowells

0 replies

20h52m

2024-07-19 21:37:43 UTC

That stopped being a solution when winning online started mattering. There are real money prizes for online game tournaments. Weekly events can have hundreds of dollars in their prize pools. Big events can have thousands.

Suddenly vote kicking had to go, because it was abused. Not in the tournaments themselves, but in open ranked play which serves as qualifiers. An active game can rack up thousands of hours of gameplay per day, far beyond the ability of competent admins to validate. Especially because cheating is often subtle. An expert can spend more than real time looking for subtle patterns that automated tools haven't been built to detect.

Games aren't between you and your 25 buddies for bragging rights anymore. They're between you and 50k other active players for cash prizes. The world has changed. Anti-cheat technology followed that change.

JasonSage

1 replies

20h13m

2024-07-19 22:17:19 UTC

I play one of those games that doesn’t strongly enforce anti-cheating, and I agree with you that it’s a huge detraction compared to games with strong anti-cheat.

But I strongly disagree about the use of invasive client-side anti-cheat. Server-side anti-cheat can reduce the number of cheaters to an acceptably low level.

See for example how lichess detects and aids in detection of cheaters: https://github.com/clarkerubber/irwin

And chess is a game where I feel like it would be relatively hard to detect cheating. An algorithm looking at games with actors moving in 3D space and responding to relative positions and actions of multiple other actors should have a great many more ways to detect cheating over the course of many games.

JasonSage

0 replies

20h6m

2024-07-19 22:23:24 UTC

And frankly, I think the incentive structure has nothing to do with whether tournaments are happening with money on the line, and a great deal more whether the company has the cash and nothing better to do.

Anti-cheat beyond a very basic level is nothing to these companies except a funnel optimization to extract the maximum lifetime value out of the player base. Only the most successful games will ever have the money or reach the technical capability to support this. Nobody making these decisions is doing it for player welfare.

Ukv

6 replies

21h40m

2024-07-19 20:49:33 UTC

Minimize the possible advantage by not sending the client other players' positions until absolutely necessary (either the client can see the other player, or there's a movement the client could make that would reveal the other player before receiving the next packet), and eliminate the cheaters you can with server-side behavior analysis and regular less-invasive client-side anticheat.

Ultimately even games with kernel anticheat have cheating issues; at some point you have to accept that you cannot stop 100.0% of cheaters. The solution to someone making an aimbot using a physically separate device (reading monitor output, giving mouse input) cannot be to require keys to the player's house.

lutoma

5 replies

21h33m

2024-07-19 20:56:57 UTC

not sending the client other players' positions until absolutely necessary (either the client can see the other player, or there's a movement the client could make that would reveal the other player before receiving the next packet)

I think the problem with this is sounds like footsteps or weapons being fired that need to be positional.

Which makes me wonder if you could get away with mixing these sounds server-side and then streaming them to the client to avoid sending positions. Probably infeasible in practice due to latency and game server performance, but fun to think about.

Ukv

4 replies

21h22m

2024-07-19 21:08:05 UTC

To whatever extent the sound is intended to only give a general direction, I'd say quantize the angle and volume of the sound before it's sent such that cheaters also only get that same vague direction. Obviously don't send inaudible/essentially-inaudible sounds to the client at all.

Workaccount2

3 replies

21h4m

2024-07-19 21:26:02 UTC

They need to just make CPU's, GPU's, and memory modules with hardware level anti-cheat. Totally optional purchase, but grants you access to very-difficult-to-cheat-in servers.

didntcheck

1 replies

20h56m

2024-07-19 21:33:43 UTC

That sort of already exists - I believe a small number of games demand that you have Secure Boot enabled, meaning you should only have a Microsoft-approved kernel and drivers running. And then the anticheat is itself probably kernel level, so can see anything in userspace

It may still be possible to get round this by using your own machine owner key or using PreLoader/shim [1] to sign a hacked Windows kernel

[1] https://wiki.archlinux.org/title/Unified_Extensible_Firmware...

quenix

0 replies

18h22m

2024-07-20 00:08:01 UTC

Another way to get around it is reading the memory directly with direct memory access (DMA) hardware.

bpye

0 replies

20h21m

2024-07-19 22:08:44 UTC

I guess you’ve just invented an Xbox/PlayStation.

kjkjadksj

2 replies

21h22m

2024-07-19 21:07:29 UTC

Did their hitbox clip through the wall? Yes? Banned. You could do it with telemetry.

Arnavion

1 replies

21h3m

2024-07-19 21:26:47 UTC

You're confusing wallhacking with noclipping. Wallhacking is being able to see through walls, like drawing an outline around all characters that renders with highest z-order, or making wall textures transparent.

It does not result in any server-side-detectable difference in behavior other than the hacker seemingly being more aware of their surroundings than they should, which can be hard to determine for sure. Depending on how the hack is done, it may not be detectable by the client either, eg by intercepting the GPU driver calls to render the outlines or switch the wall textures.

kjkjadksj

0 replies

46m

2024-07-20 17:44:03 UTC

if wall between client and another client == true; do dont render other client; fi

Am4TIfIsER0ppos

1 replies

21h31m

2024-07-19 20:58:45 UTC

Standalone servers. Run your own then you can ban anyone you like, or better still only allow anyone you like.

cobalt60

0 replies

21h23m

2024-07-19 21:07:09 UTC

Nothing like sourcemodded server! Good old days!

josephcsible

0 replies

21h16m

2024-07-19 21:13:51 UTC

Stop thinking about trying to catch wallhackers. Instead, make wallhacking impossible. Do that by fixing the server to, instead of sending all player positions to everyone, only send player positions to clients that they have an unobstructed view of.

majormajor

5 replies

22h16m

2024-07-19 20:13:36 UTC

I can't imagine gaming software being affected at all, unless MS does a ton of cracking down (and would still probably give hooks for gaming since they have gaming companies in their umbrella).

No corporate org is gonna bat an eye at Riot's anti-cheat practices, because they aren't installing LoL on their line of business machines anyway.

InitialLastName

2 replies

21h45m

2024-07-19 20:44:59 UTC

Right, MS just paid $75e9 for a company whose main products are competitive multiplayer games. They are never going to be incentivized to compromise that sector by limiting what anti-cheat they can do.

Y_Y

1 replies

21h26m

2024-07-19 21:03:28 UTC

That's 7.5e10 USD in SI.

InitialLastName

0 replies

19h37m

2024-07-19 22:53:07 UTC

Engineering notation used for prefix convenience.

tgsovlerkhgsel

0 replies

19h43m

2024-07-19 22:46:36 UTC

because they aren't installing LoL on their line of business machines anyway

But if their business is incompatible with strict software whitelisting, their employees might...

minetest2048

0 replies

20h24m

2024-07-19 22:05:59 UTC

Until the malware bring their own compromised signed anti cheat driver on their own, like what happened with Genshin Impact anti cheat mhyprot2

frizlab

1 replies

21h14m

2024-07-19 21:15:25 UTC

Unless I’m mistaken on macOS at least kernel access is just not possible, so at least there’s that.

heraldgeezer

0 replies

18h57m

2024-07-19 23:32:45 UTC

And Valorant, CS2, CoD dont run on macOS due to anti cheat :)

pityJuke

0 replies

20h59m

2024-07-19 21:31:04 UTC

The problem you’re fighting is cheat customers who go “random kernel-level driver? no problem!”

kragen

32 replies

21h11m

2024-07-19 21:18:27 UTC

this seems like the second or third test file any qa person would have tried, after an empty file and maybe a minimal valid file. the level of pervasive incompetence implied here is staggering

in a market where companies compete by impressing nontechnical upper management with presentations, it should be no surprise that technically competent companies have no advantage over incompetent ones

i recently read through the craig wright decision https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who fraudulently claimed to be satoshi nakamoto) and he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist (decompiling malware to c); he didn't know what 'unsigned' meant when questioned on the witness stand. he'd been doing infosec work for big companies going back to the 90s. he'd apparently been tricking people with technobabble and rigged demos and forged documents for his entire career

george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd... https://en.wikipedia.org/wiki/George_Kurtz

crowdstrike itself caused the same problem on debian stable three months ago: https://old.reddit.com/r/debian/comments/1c8db7l/linuximage6...

it's horrifying that pci compliance regulations have injected crowdstrike (and antivirus) into virtually every aspect of today's it infrastructure

GardenLetter27

10 replies

21h4m

2024-07-19 21:25:54 UTC

Also ironic that the compliance ended up introducing the biggest vulnerability as a massive single point of failure.

But that's government regulation for you.

kragen

3 replies

20h54m

2024-07-19 21:35:47 UTC

pci-dss is not a government agency but it might as well be; it's a collusion between visa, mastercard, american express, discover, and jcb to prevent them from having different data security standards (and therefore being able to compete on security)

derefr

2 replies

20h41m

2024-07-19 21:48:52 UTC

You mean “and therefore requiring businesses that take credit cards to enforce the union of all the restrictions imposed by all six companies (which might not even be possible—the restrictions might be contradictory) in order to accept all six types of cards”

kragen

1 replies

19h22m

2024-07-19 23:08:18 UTC

you could imagine that if visa and mastercard had incompatible restrictions, such that merchants had to choose which one to accept, people would gradually shift to the card that got their money stolen less often, or whose (merchants') websites worked more reliably

since, instead, every business that uses credit cards at all is required to use the same set of counterproductive 'security' policies that make them more vulnerable than before, there is no way for market reality to ground these 'security' policies in real-world security. that's exactly the same problem that happens with government regulation

jfyi

0 replies

18h34m

2024-07-19 23:55:39 UTC

I've seen several merchants stop using amex because of restrictions. Admittedly those restrictions were very much in the form of excessive fees and nobody would have cared about their customer's stolen money.

Anyway, there isn't total solidarity within the credit card cartel.

babypuncher

1 replies

20h9m

2024-07-19 22:21:22 UTC

this had nothing to do with government regulation, thank private sector insurance companies.

czbond

0 replies

19h59m

2024-07-19 22:30:47 UTC

Pci-dss is a method for card companies to allay the risk onto the merchant and away from the card companies - just like insurance.

phatfish

0 replies

19h57m

2024-07-19 22:32:52 UTC

It seems government IT systems in general faired pretty well the last 12 hrs, but loads of large private companies were effectively taken offline, so there's that.

moandcompany

0 replies

20h1m

2024-07-19 22:28:55 UTC

It's definitely ironic, and compatible with the security engineering world joke that the most secure system is one that cannot be accessed or used at all.

I suppose one way to "stop breaches" is to shut down every host entirely.

In the military world, there is a concept of an "Alpha Strike" which generally relates to a fast-enough and strong-enough first-strike that is sufficient to disable the adversary's ability to respond or fight back (e.g. taking down an entire fleet at once). Perhaps people that have been burned by this event will start calling it a Crowdstrike.

bandyaboot

0 replies

19h17m

2024-07-19 23:12:26 UTC

What’s frustrating is that after falsely attributing pci to government regulation and being corrected, you’re probably not going to re-examine the thought process that led you to that false belief.

acdha

0 replies

20h34m

2024-07-19 21:56:12 UTC

But that's government regulation for you.

You misspelled “private sector”. Use of endpoint monitoring software is coming out of private auditing companies driven by things like PCI or insurers’ requirements – almost nobody wants to pay for highly-skilled security people so they’re outsourcing it to the big auditing companies and checklists so that if they get sued they can say they were following industry practices and the audit firms okayed it.

martin-t

5 replies

20h31m

2024-07-19 21:59:21 UTC

We as a society need to start punishing incompetence the same way we punish malice.

Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people. Power allows people to distance themselves from the act. Distance should not affect the punishment.

chuckadams

2 replies

19h12m

2024-07-19 23:17:40 UTC

So which individual engineer are you locking up and telling their kids that they don't get to see their mommy or daddy again?

intentionally causing harm

Maybe we need to criminalize groundless accusations too.

martin-t

0 replies

11h19m

2024-07-20 07:11:08 UTC

Don't pretend you don't understand.

Failures like this aren't an individual programmer's responsibility. I can guarantee you programmers have been asking to implement staggered releases and management shut them down. In fact another poster in this thread showed that even customers asked for that and were rejected with 50 pages of corporate BS which basically amounted to "we don't wanna".

Same goes for other safety mechanisms like testing.

Maybe we need to criminalize groundless accusations too.

I never accused them of intentionally causing harm. Please read my comment more carefully. I said what i said because in the current environment, even doing this intentionally would not lead to sufficient punishment to top management.

SturgeonsLaw

0 replies

14h16m

2024-07-20 04:13:57 UTC

I hear it was that Ron guy's fault

yourapostasy

0 replies

19h18m

2024-07-19 23:12:21 UTC

> Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people.

Legal teams are way ahead of you here, by design. They’ve learned to factor out responsibility in such a diffuse manner the “indirectly” term loses nearly all its power the closer one gets to executive ranks and is a fig leaf when you reach C-levels and BoD levels. We couldn’t feasibly operate behemoth corporations otherwise in litigious environments.

I personally suspect we cannot enshrine earnestness and character as metrics and KPI’s, even indirectly. From personal experience selling into many client organizations and being able to watch them make some sausage, this is where leadership over managing becomes a key distinction, and organizational culture over a long period measured in generations of decisions and actions implemented by a group with mean workforce tenure measured in decades becomes one of the few reliable signals. Would love to hear others’ experiences and observations.

worik

0 replies

19h58m

2024-07-19 22:32:21 UTC

We as a society need to start punishing incompetence the same way we punish malice.

Yes

But competence is marketed

The trade names like "Crowdstrike" and "Microsoft "

baxtr

5 replies

19h39m

2024-07-19 22:50:32 UTC

I think the worst part of the incident is that state actors now have a clear blueprint for a large scale infrastructure attack.

IAmNotACellist

2 replies

19h36m

2024-07-19 22:53:34 UTC

I can think of a lot better things to put in a kernel-level driver installed on every critical computer ever than a bunch of 0s.

kragen

0 replies

19h11m

2024-07-19 23:18:55 UTC

technically the 0s were a data file read by the driver, not the driver itself

BigParm

0 replies

19h22m

2024-07-19 23:08:11 UTC

Like what? Kernels and drivers are out of my wheelhouse, I'm just curious.

Guthur

1 replies

19h17m

2024-07-19 23:12:31 UTC

You're assuming it wasn't an attack.

Just the same week Kaspersky gets kicked from the us market...

jfyi

0 replies

18h13m

2024-07-20 00:16:58 UTC

They are assuming that crowdstrike is telling them the truth, but yeah I keep thinking about Kaspersky too. That's a whole other pile of assumptions though.

worik

3 replies

20h0m

2024-07-19 22:29:31 UTC

george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd...

I find it amusing that the people commenting on that link are offended this called a "Microsoft " outage, when it is "Crowdstrike's fault".

This is just as much a Microsoft failure.

This is even more, another industry failure

How many times does this have to happen before we get some industry reform that lets us do our jobs and build the secure reliable systems we have spent seven decades researching?

1988 all over again again again

TeMPOraL

1 replies

19h37m

2024-07-19 22:53:15 UTC

It's simple: the failure is not specific to the OS.

Crowdstrike runs on MacOS and Linux workstations too. And it's just as dangerous there; the big thread has stories of Crowdstrike breaking Linux systems in the past months.

Crowdstrike isn't needed by/for Windows, it's mandated by corporate and government bureaucracies, where it serves as a tool of employee control and a compliance checkbox to check.

That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.

BHSPitMonkey

0 replies

19h21m

2024-07-19 23:08:44 UTC

That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.

CrowdStrike _is_ running on a massive number of Linux servers/endpoints, with the same kind of auto-update policies.

kragen

0 replies

18h58m

2024-07-19 23:31:28 UTC

the reforms you're hoping for are not going to happen until the countries with bad infosec get conquered by the countries with good infosec. that is going to be much worse than you can possibly imagine

it's not microsoft's fault at all; crowdstrike caused the same problem on debian systems three months ago. the only way it could be microsoft's fault is if letting users install kernel drivers is microsoft's fault

emporas

3 replies

18h57m

2024-07-19 23:33:19 UTC

he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist

Pretty ironic Craig also suggests to put digital certificates on the blockchain instead of them issued by a computer or a cluster of computers, in one place, which makes them a target. A peer to peer network is much more resilient to attacks.

Was the problem, or a part of the problem, that the root certificate of all computers was replaced by CrowdStrike? I didn't follow that part closely. If it was, then certificates registered on the blockchain might be a solution.

I'm totally not an expert in cryptography, but that seems plausible.

As a side note, it is only when i heard Craig's Wright explanation of the blockchain that the bitcoin whitepaper started making a lot of sense. He may not be the greatest of coders, but are mathematicians useless in security?

kragen

2 replies

18h23m

2024-07-20 00:06:38 UTC

none of your comment makes any sense. possibly it was generated by a large language model with no understanding of what is being talked about

emporas

1 replies

17h54m

2024-07-20 00:35:33 UTC

is it not the problem, that CrowdStrike takes hold of the root certificate authority of a computer? There is an update, and after that what is causing the problem?

If it is the certificate, that could be solved in the future by putting the certificates on the blockchain. If it is just incompetent management, then i don't see a great deal of technical solutions.

Craig Wright, claims to be a mathematician not a stellar coder.

kragen

0 replies

17h13m

2024-07-20 01:17:18 UTC

that comment does make some sense, but it is still wrong from beginning to end; everything it says is close to the opposite of the truth

worstspotgain

0 replies

20h41m

2024-07-19 21:48:51 UTC

I don't mean to sound conspiratorial, but it's a little early to rule out malfeasance just because of Hanlon's Razor just yet. Most fuckups are not on a ridonkulous global scale. This is looking like the biggest one to date, the Y2K that wasn't.

millero

15 replies

23h14m

2024-07-19 19:16:15 UTC

Yes, this fits in with what I heard on the grapevine about this bug from a friend who knows someone working for Crowdstrike. The bug had been sitting there in the kernel driver for years before being triggered by this flawed data, which actually was added in a post-processing step of the configuration update - after it had been tested but before being copied to their update servers for clients to obtain.

Apparently, Crowdstrike's test setup was fine for this configuration data itself, but they didn't catch it before it was sent out in production, as they were testing the wrong thing. Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure, in whatever post-mortem writeup they may release.

finaard

13 replies

22h6m

2024-07-19 20:24:04 UTC

You need to be a very special kind of stupid to think postprocessing anything after you've tested it is a good idea.

dgfitz

5 replies

21h27m

2024-07-19 21:02:26 UTC

Hmm, I post-process autonomous vehicle logs probably daily.

Why is this stupid? It’s pretty useful to see a graph of coolant temp vs ambient temp vs motor speed vs roll/pitch.

I must be especially stupid I suppose. Nuts.

Flockster

3 replies

21h19m

2024-07-19 21:10:55 UTC

That is not remotely what was meant..

dgfitz

2 replies

21h10m

2024-07-19 21:20:20 UTC

Perhaps word choice and sentence structure are important then.

jmull

0 replies

20h39m

2024-07-19 21:50:27 UTC

I don't think people should restate the basic context of the thread for every post... That's a lot of work and noise, and probably the same people who ignore the thread context would also ignore any context a post provided.

heylook

0 replies

20h48m

2024-07-19 21:42:12 UTC

What you have just said is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having listened to it. I award you no points, and may God have mercy on your soul.

wri321

0 replies

21h6m

2024-07-19 21:23:57 UTC

This is comparable to modifying the system under test after it has been validated and not simply looking at recorded data.

dist-epoch

4 replies

20h52m

2024-07-19 21:37:24 UTC

"We need to ship this by Friday. Just add a quick post-processing step, and we'll fix it next week properly" - how these things tend to happen.

yard2010

3 replies

19h42m

2024-07-19 22:47:58 UTC

In my first engineering job ever, I worked with this snarky boss who was mean to everyone and just said NO every time to everything. She also had a red line: NO RELEASES ON THE LAST DAY OF THE WEEK. I couldn't understand why. Now, 10 years later, I understand I just had the best boss ever. I miss you, Vik.

arp242

1 replies

19h36m

2024-07-19 22:53:39 UTC

I still have a 10-year old screenshot from a colleague breaking production on a Friday afternoon and posting "happy weekend everyone!" just as the errors from production started to flood in on the same chat. And he really did just fuck off leaving us to mop up the hurricane of piss he unleashed.

He was not my favourite colleague.

lovehashbrowns

0 replies

13h17m

2024-07-20 05:12:43 UTC

There’s someone from a company I worked at a few years ago that pumped out some trash on the last week before their two month sabbatical (or three?). I remember how pissed I was seeing the username in the commit because I recognized it from their “see you in a few months” email lmao

kmxr

0 replies

17h1m

2024-07-20 01:28:29 UTC

How would the situation have been any different if this was released on a Monday?

arp242

1 replies

19h40m

2024-07-19 22:49:35 UTC

"I heard on the grapevine from a friend who knows someone working for Crowdstrike" is perhaps not the most reliable source of information, due to the game of telephone effect if nothing else.

And post-processing can mean many things. Could be something relatively simple such as "testing passed, so lets mark the file with a version number and release it".

finaard

0 replies

11h26m

2024-07-20 07:03:49 UTC

Could be something relatively simple such as "testing passed, so lets mark the file with a version number and release it".

I'd argue you shouldn't even do that. When I've been building CI systems pushing stuff to the customer in the past we've always been automatically versioning everything, and the version that's in the artifact you've found to be good is just the one you're going to be releasing.

swat535

0 replies

13h50m

2024-07-20 04:40:14 UTC

I think Crowdstrike is due for more than just "owning up to it". I sincerely hope for serious investigation, fines and legal pursuits despite having "limited liabilities" in their agreements.

Seriously, I don't even know how to do the math on the amount of damage this caused (not including the TIME wasted for businesses as well as ordinary people, for instance those taking flights)

There has to be consequences for this kind of negligence

hn_throwaway_99

13 replies

23h3m

2024-07-19 19:27:01 UTC

On a related note, I don't think that it's a coincidence that 2 of the largest tech meltdowns in history (this one and the SolarWinds hack from a few years ago) were both the result of "security software" run amok. (Also sad that both of these companies are based in Austin, which certainly gives Austin's tech scene a black eye).

IMO, I think a root cause issue is that the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture. For example, I can't speak so much for CrowdStrike, but it came out the SolarWinds had an egregiously bad security culture at their company. When the root cause comes out about this issue dollars-to-donuts it was just a fast and loose deployment process.

ajsnigrutin

2 replies

21h18m

2024-07-19 21:11:43 UTC

Security software needs kernel level access.. if something breaks, you get boot loops and crashes.

Most other software doesn't need that low level of access, and even if it crashes, it doesn't take the whole system with it, and a quick, automated upgrade process is possible.

rahkiin

1 replies

20h59m

2024-07-19 21:30:25 UTC

Security software needs kernel level access.. *on Windows. macOS has an Endpoint Security userland extension api

sharkjacobs

0 replies

20h39m

2024-07-19 21:50:25 UTC

This seems like a pretty clear example of the philosophical divide between MacOS and Windows.

A good developer with access to the kernel can create "better" security software which does less context switching and has less performance impact. But a bad (incompetent or malicious) developer can do a lot more harm with direct access to the kernel.

We see the exact same reasoning with restricting execution of JIT-compiled code in iOS.

moandcompany

1 replies

21h21m

2024-07-19 21:08:28 UTC

Crowdstrike was originally founded and headquartered in Irvine, CA (Southern California). In those days, most of its engineering organization was either remote/WFH or in Irvine, CA.

As they got larger, they added a Sunnyvale office, and later moved the official headquarters to Austin, TX.

They've also been expanding their engineering operations overseas which likely includes offshoring in the last few years.

nullify88

0 replies

20h4m

2024-07-19 22:26:13 UTC

They bought out Humio in Aarhus, Denmark. Now Falcon Logscale.

cedws

1 replies

20h21m

2024-07-19 22:09:21 UTC

the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture

I disagree, security companies suffer from "too big to fail" syndrome where the money comes easy because they have customers who want to check a box. Security is NOT a product you pay for, it's a culture that takes active effort and hard work to embed from day one. There's no product on the market that can provide security, only products to point a finger at when things go wrong.

Andrex

0 replies

20h12m

2024-07-19 22:17:40 UTC

The market is crying for some kind of '10s "agile hype" equivalent for security evangelism and processes.

NegativeK

1 replies

21h27m

2024-07-19 21:03:23 UTC

Alternate hypothesis that's not necessarily mutually exclusive: security software tends to need significant and widespread access. That means that fuckups and compromises tend to be more impactful.

hn_throwaway_99

0 replies

20h22m

2024-07-19 22:08:03 UTC

100% agree with that. The thing that baffles me a bit, then, is that if you are writing software that is so critical and can have such a catastrophic impact when things go wrong, that you double and triple check everything you do - what you DON'T do is use the same level of care you may use with some social media CRUD app (move fast and break things and all that...)

To emphasize, I'm really just thinking about the bad practices that were reported after the SolarWinds hack (password of "solarwinds123" and a bunch of other insider reports), so I can't say that totally applies to CrowdStrike, but in general I don't feel like these companies that can have such a catastrophic impact take appropriate care of their responsibilities.

worstspotgain

0 replies

19h6m

2024-07-19 23:24:04 UTC

The industry's track record is indeed quite peculiar. Just off the top of my head:

- CrowdStrike

- SolarWinds

- Kaspersky

- https://en.wikipedia.org/wiki/John_McAfee

I'm sure I'm forgetting a few. Maybe it's the same self-selection scheme that afflicts the legal cannabis industry?

koliber

0 replies

21h30m

2024-07-19 21:00:10 UTC

Don't forget heartbleed, a vulnerability in OpenSSL, the software that secures pretty much everything.

heraldgeezer

0 replies

18h59m

2024-07-19 23:30:55 UTC

I would add Kaseya VSA Ransomware attack in 2021 to that list.

compacct27

0 replies

21h26m

2024-07-19 21:03:26 UTC

The Austin tech culture is…interesting. I stopped trying to find a job here and went remote Bay Area, and talking to tech workers in the area gave me the impression it’s a mix of slacker culture and hype chasing. After moving back here, tech talent seems like a game of telephone, and we’re several jumps past the original.

When I heard CrowdStrike was here, it just kinda made sense

breadwinner

12 replies

22h3m

2024-07-19 20:26:37 UTC

I blame Microsoft. Why? Because they rely on third parties to fill in their gaps. When I buy a Mac it already has drivers for my printers, but not if I buy a Windows PC. Some of these printers drivers are 250 MB, which is a crazy size for a driver. If it is more than a few 100 KB it means the manufacturer does not know how to make a driver software. Microsoft should make it unnecessary to rely on crappy third party software so much.

luuurker

8 replies

21h45m

2024-07-19 20:45:19 UTC

CrowdStrike's mess up is CrowdStrike's fault, not Microsoft's. We might not like the way Windows works, but it usually works fine and more restrictive systems also have downsides. In any case, it was CrowdStrike who dropped the ball and created this mess.

I don't like what Microsoft is doing with Windows and only use it for gaming (I'm glad Linux is becoming a good option for that), so I'm far from being a "Microsoft fan", but Windows is very good at installing the software needed. Plug a GPU, mouse, etc, from any well known brand and it should work without you doing much.

I didn't have to install anything on my Windows PC (or my MBP) last time I bought a new printer (Epson). The option to let Windows install the drivers needed is enabled though... some people disable that.

taftster

3 replies

19h27m

2024-07-19 23:03:02 UTC

Windows should NEVER blue screen for third-party software. This is definitely as much Microsoft's fault as anyone's.

Maybe Microsoft doesn't offer an ABI or whatever other access is needed for the CS module. So there's some work that needs to be done on the kernel to enable the functionality to run in user space. Third-party libraries should not have privileged access to the kernel internals.

In this case, the CloudStrike kernel module attempted to read a bogus data file and crashed. The Windows kernel should detect the module crash, inform you in some meaningful way, but continue to boot. Maybe even in some limited execution model, but not a blue screen.

CloudStrike should have tested better. A LOT better. Multiple layers of tests better. They should most definitely take serious their obligations to the millions of customers that are affected by this. But Microsoft needs to own some of the responsibility too, by allowing a 3rd party kernel module to even be able to bring down its system. Just because the program was signed, doesn't mean when it crashes, Windows should crash with it.

blahyawnblah

1 replies

18h57m

2024-07-19 23:32:24 UTC

Windows should NEVER blue screen for third-party software. This is definitely as much Microsoft's fault as anyone's.

It's Microsoft's fault that someone wrote drivers for its OS?

breadwinner

0 replies

17h29m

2024-07-20 01:00:48 UTC

It is Microsoft's fault that computers running Windows caused 911 systems to fail, surgeries to be postponed, flights to be canceled and so on. Yes, a third party was involved, but why didn't Microsoft have a system for vetting third parties before giving them kernel access?

Apple won't let you add an app to their app store without a vetting process. Microsoft lets you alter the kernel without a vetting process? How many people died today because they didn't get surgeries and because of failed 911 calls?

luuurker

0 replies

18h58m

2024-07-19 23:32:19 UTC

Windows should NEVER blue screen for third-party software.

Agreed, it's a Windows weakness. It should be improved. But if the system boots normally even if something like CrowdStrike fails, then you create other problems, which may be more serious than a BSOD.

I don't put the blame on Microsoft because enterprise customers should know how Windows works. Should we use Windows and some low level, 3rd party, remotely updated software (without a slow, controlled rollout) if you can't have BSODs? Yeah...

It's a big "cock up". I blame mostly CrowdStrike because they're the ones who messed up. Then we have those who decided to use Windows with CrowdStrike even on very sensitive systems. Microsoft is at fault for creating an OS with this weakness, but this is how Windows works... if it's not good enough for the task, don't use it.

breadwinner

3 replies

21h12m

2024-07-19 21:17:44 UTC

CrowdStrike's mess up is CrowdStrike's fault, not Microsoft's.

Disagree. It is everyone's fault. It is CrowdStrike's fault for not testing their product. It is Microsoft's fault for allowing CrowdStrike to mess with kernel and not vetting such critical third parties. It is the end customers' fault for installing crapware and not vetting the vendor.

yoavm

1 replies

20h19m

2024-07-19 22:10:29 UTC

so now we're vouching for more restrictive operating systems? the last thing I want is an operating system that can only install vetted apps, and that these apps are restricted even if I provide my root password.

taftster

0 replies

19h21m

2024-07-19 23:08:37 UTC

Allowing an application to run with privilege is different than allowing an application to crash the kernel. These are two different things.

CloudStrike is effectively running as a Windows kernel module. Which in Windows, might as well be the Windows kernel itself. There should be a deliberate difference between things that a bare operating system needs to function vs. things which are allowed to execute in a privileged manner.

This isn't a restrictive operating system. You have to trust your operating system vendor just a little more than you trust the 3rd party software installed on it.

luuurker

0 replies

19h36m

2024-07-19 22:53:54 UTC

We expect different things from the OS we use, I guess.

My main machine is a Macbook Pro and one thing that annoys me a lot is the way Apple handles apps that are not notarized. I don't use iPhones because of the system restrictions (file access, background running, etc) and because I can only install what Apple allows on their store. You can see why I don't want Microsoft to hold my hand when I use Windows... it's my machine, I paid for it, I should be able to install crapware and extend the system functionality if that's what I want especially when I pick an OS that allows me to do that.

In this case, enterprise customers decided to use an OS that allows them to also use CrowdStrike. Maybe Microsoft could handle this stuff better and not show a BSOD? I guess so, but I won't blame them for allowing these tools to exist.

Don't get me wrong, there's a place for very restrictive operating systems like iOS or ChromeOS, but they're not for everyone or enough for all tasks. Windows is a very capable OS, certainly not the best option for everyone, but the day Microsoft cripples Windows like that, it's the day I am forced to stop using it.

Alghranokk

1 replies

21h42m

2024-07-19 20:47:56 UTC

I think this is unfair; m$ does provide perfectly usable generic printer drivers, as long as you only use basic universal features. The problem is that the printer producers each want to provide a host of features on top of that, each in their own proprietary way, with post print hole-punching, 5 different paper trays, user boxes, 4 different options for duplex printing.

Also, label printers, why the heck does zebra only do EPL or ZPL? Why not pcl6 or PS like the rest of the universe?

The point is that printers are bullshit. Nobody knows how they work, and assuming that microsoft should just figure it out on its own is at least in my opinion, unreasonable.

breadwinner

0 replies

21h10m

2024-07-19 21:20:05 UTC

What Windows was known for in the 1990s, is good quality 1st party drivers. Then after Windows achieved monopoly status they shifted driver responsibility to device manufacturers. I have never had to install a third party driver on a Mac, but on Windows I do. If Apple can do it Microsoft can too.

mardifoufs

0 replies

19h42m

2024-07-19 22:48:07 UTC

Which printer did you try it with? I've never had issues with printing out of the box with windows or mac. At least not for the past 5 years.

Also, I'm glad Microsoft doesn't provide an easy way to get what is essentially complete control over a machine, and every single event/connection/process that it has.

bryanlarsen

11 replies

23h15m

2024-07-19 19:14:53 UTC

Segue: What the heck is about Windows files and null characters? I've been almost exclusively dealing with POSIX file systems for the last 30 years, but I'm currently shipping a cross-platform app and a lot of my Windows users are encountering corrupted files which exhibit a bunch of NULs in random places in the file. I've added ugly hacks to deal with them but it'd be nice to get down to root causes. Is there a handy list of things that are safe on POSIX but not on Windows so I can figure out what I'm doing wrong?

I'm at the stage where I'm thinking "%$#@ this, I'm never going to write to the Windows file system again, I'm just going to create an SQLite DB and write to that instead". Will that fix my problems?

nradov

5 replies

23h8m

2024-07-19 19:21:29 UTC

The Windows NTFS is safe and reliable. It doesn't corrupt files. You have probably misunderstood the problem.

omoikane

3 replies

21h57m

2024-07-19 20:32:42 UTC

I have observed a file filled NULs that was caused by a power loss in the middle of a write -- my UPS alerted me that utility power is gone, I tried to shutdown cleanly, but the battery died before a clean shutdown completed. This was NTFS on a HDD and not a SSD.

I am not saying it happens often, but it does happen once in a while.

dist-epoch

0 replies

20h47m

2024-07-19 21:43:12 UTC

NTFS guarantees file-system metadata integrity, not file data integrity. Subtle but important difference.

The file was corrupted, but the file-system remained consistent.

bryanlarsen

0 replies

21h52m

2024-07-19 20:38:23 UTC

Yes, corruption does appear to be correlated with power cycles.

ale42

0 replies

21h30m

2024-07-19 21:00:04 UTC

Had the same on journaled ext4 on Linux. Lots of NULL bytes in the middle of the syslog because of unclean shutdown.

bryanlarsen

0 replies

23h3m

2024-07-19 19:27:10 UTC

However I'm using it it certainly isn't. It's certainly quite likely the problem is me, not Windows.

Most of the files that are getting corrupted are being written to in an append-only fashion, which is generally one of the mechanisms for writing to files to avoid corruption, at least on POSIX.

dist-epoch

1 replies

20h43m

2024-07-19 21:47:03 UTC

A lot of your users encountering file corruption on Windows either means that somehow your users are much more likely to have broken hardware, or more realistically that you have a bug in your code/libraries.

bryanlarsen

0 replies

17h58m

2024-07-20 00:32:07 UTC

I think omoikane pinpointed the primary cause, hard power loss. I definitely do some writing when receiving the shutdown signal, and I bet some people initiate a normal shutdown but remove power before it finishes. I think trying to avoid writes upon receipt of a shutdown signal will help.

tatersolid

0 replies

22h9m

2024-07-19 20:21:03 UTC

Returning all zeros randomly is one of the failure modes of crappy consumer SSDs with buggy controllers. Especially those found in cheap laptops and on Amazon. If it’s a fully counterfeit drive it might even be maliciously reporting a size larger than it has flash chips to support. It will accept writes without error but return zeros for sectors that are “fake”. This can appear random due to wear-leveling in the controller.

o11c

0 replies

19h3m

2024-07-19 23:26:48 UTC

Normally, there are 3 notable things that are unique about Windows files:

* It's standard for "security" software to be installed and hook into the filesystem. Normally this only means giving up on several orders of magnitude of performance, but as this incident reminds us: security software is indistinguishable from malware.

* The system has a very long history of relying on exclusive access to files. This is one of the major reasons for Windows needing to restart so many times, but also leads to a lot of programs being badly written, among other things making false assumptions regarding atomicity in the presence of system failure.

* There's a difference between text files and binary files, and thus a whole pain of adding/removing \r before \n (which among other things means byte counts are wrong and checksums mismatching).

alexisread

0 replies

22h2m

2024-07-19 20:27:38 UTC

I'd guess yes, and you get the SQL goodness to boot :)

Sounds like you have an encoding issue somewhere, windows has it's own charset - Windows-1252, so I'd vet all your libs that touch the file (including eg. .Net libs etc). If one of them defaults to that encoding you may get it either mislabelling the file encoding, or adding in null after each append etc.

SQLite is tested cross-platform so 100% the file will be cross-platform compatible.

Retr0id

11 replies

23h24m

2024-07-19 19:05:44 UTC

While I won't discount it entirely, I think the people acting like this (alone) implies malice are being very silly.

midtake

5 replies

20h27m

2024-07-19 22:02:31 UTC

I disagree. In the current day the stakes are too highly to naively attribute software flaws to incompetence. We should assume malice until it is ruled out, otherwise it will become a vector for software implants. These are matters of national security at this point.

gerdesj

2 replies

19h38m

2024-07-19 22:51:28 UTC

"These are matters of national security at this point."

Which nation exactly? Who on earth "wins" by crashing vast numbers of PCs worldwide?

Many of the potential foes you might be thinking of are unlikely to actually run CS locally but its bad for business if your prey can't even boot their PCs and infra so you can scam them.

I might allow for a bunch of old school nihilists getting off on this sort of rubbish but it won't last and now an entire class of security software, standards and procedures will be fixed up. This is no deliberate "killer blow".

Who knew that well meaning security software running in Ring 0 could fuck up big style if QA takes a long walk off a short plank? Oh, anyone who worked in IT during the '90s and '00s! I remember Sophos and McAfee (now Trellix) and probably others managing to do something similar, back in the day.

Mono-cultures are subject to pretty catastrophic failures, by definition. If you go all in with the same thing as everyone else then if they sneeze, you will catch the 'flu too.

SoftTalker

1 replies

19h9m

2024-07-19 23:20:39 UTC

Who on earth "wins" by crashing vast numbers of PCs worldwide?

Anyone who needs a big distraction so he can more likely achieve his real objective unnoticed.

gerdesj

0 replies

18h38m

2024-07-19 23:52:23 UTC

You nearly managed a perfect comment: (t)he(y)

Fair enough but if you are going to take over the world you don't crash everything, indiscriminately unless you are a really sad old school nihilist.

This is not subtle. It is so stupid that the only explanation is stupidity or a silly experiment.

The first thing I did is start to analyze my logs, just in case. I don't use CS but therefore by the Grace of God, go I!

If you have a particular target, you don't fuck up the entire world.

Retr0id

1 replies

19h57m

2024-07-19 22:33:07 UTC

You can (and should) want to identify the root cause, without assuming malice.

midtake

0 replies

17h35m

2024-07-20 00:54:59 UTC

Of course RCA should be conducted. But malice should be considered a likelihood until ruled out entirely. Without assuming malicious intent, we will overlook many avenues of attack.

Think about it this way. At the present time, CrowdStrike can accept dark money from an entity in order to halt the world's compute capacity in a plausibly deniable way. When the dust settles, they will be found guilty of negligence at most. This will allow this perverse incentive to persist.

If the stakes for bringing down a great deal of American compute infrastructure were greater, this would happen far less often. It is simple economics of risk.

Furthermore, CrowdStrike, being an entity that enjoys access to privileged sales channels thanks to its relationship with the government, is exactly the sort of entity that should be viewed as skeptically as possible, if the interests of the US Government are paramount. Why should the US Government create a small monopoly for CrowdStrike and not expect the strictest controls? That is not only bizarre, it is stupid.

Hope for the best and plan for the worst. That is how an entity should be viewed when it becomes critical to national security.

JumpCrisscross

4 replies

23h18m

2024-07-19 19:12:05 UTC

It demonstrates terrible QC.

Retr0id

2 replies

22h48m

2024-07-19 19:41:26 UTC

Clearly something failed catastrophically, but it could well be post-QC

cjbprime

1 replies

22h9m

2024-07-19 20:20:46 UTC

There should be no "post-QC". You do gradual rollout across the fleet, while checking your monitoring to ensure the fleet hasn't gone down.

Retr0id

0 replies

20h46m

2024-07-19 21:43:26 UTC

Non-gradual-rollout updates are an exacerbating factor, but it isn't a root cause.

loloquwowndueo

0 replies

23h14m

2024-07-19 19:15:34 UTC

Don’t mess with Quebec :D

TillE

8 replies

23h30m

2024-07-19 19:00:16 UTC

That probably explains how it got past internal testing. Something went wrong after that, during deployment.

zimpenfish

1 replies

23h23m

2024-07-19 19:06:31 UTC

Someone on the Fediverse conjectured that it might have been down to the Azure glitch earlier in the day. An empty file would fit that if they weren't doing proper error checking on their downloads, etc.

homero

0 replies

23h13m

2024-07-19 19:16:55 UTC

It's crazy if they weren't signing and verifying downloads

janice1999

1 replies

23h15m

2024-07-19 19:15:21 UTC

It's still crazy that a security tool does not validate content files it loads from disk that get regularly updated. Clearly fuzzing was not a priority either.

Zigurd

0 replies

21h38m

2024-07-19 20:52:15 UTC

How many years has this Crowdstrike code been running without issues? You have put your finger on it: Fuzzing should have been part of a test plan. Even TDD isn't a bastard test engineer writing tests that probe edge cases. Even observing that your unit tests have good code coverage isn't a substitute for fuzzing. There is even a counter-argument that something that been reliable in the field should not be fixed for reasons like failing a test case never seen in real deployments, so why go making trouble.

behnamoh

1 replies

23h25m

2024-07-19 19:04:34 UTC

It could be as simple as cosmic radiation that flipped a bit (it has happened before: https://www.independent.co.uk/news/science/subatomic-particl...), or as sophisticated as an adversarial hacking.

rvnx

0 replies

23h17m

2024-07-19 19:13:17 UTC

The same cosmic radiation that flips the bits to make some specific political party win.

password4321

0 replies

23h8m

2024-07-19 19:21:53 UTC

https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

gbin

0 replies

21h47m

2024-07-19 20:43:15 UTC

I don't understand, how the signature even worked? Please please tell me those drivers are signed... Right? ...

bloopernova

6 replies

23h21m

2024-07-19 19:08:46 UTC

Note to self: on Monday, add a null character check to pre-commit hooks, and add the same check to pipelines.

Retr0id

2 replies

23h19m

2024-07-19 19:10:27 UTC

It's perfectly normal for binary artifacts to contain null bytes, even long runs of them.

bloopernova

1 replies

23h15m

2024-07-19 19:14:30 UTC

Yeah, I'd need to figure it out properly, but for unicode text files it should be OK. Good point about the binaries though, thank you!

hawski

0 replies

21h20m

2024-07-19 21:09:43 UTC

You say Unicode, but you mean UTF-8. Now for 16 bit Unicode the story is different :)

MilStdJunkie

1 replies

22h20m

2024-07-19 20:10:04 UTC

I mentioned it in a separate parent, but null purge is - for the stuff I work with - completely non-negotiable. Nulls seem to break virtually everything, just by existing. Furthermore, old-timey PDFs are chock full of the things, for God knows what reason, and a huge amount of data I work with are old-timey PDF.

toast0

0 replies

22h9m

2024-07-19 20:21:04 UTC

Furthermore, old-timey PDFs are chock full of the things, for God knows what reason, and a huge amount of data I work with are old-timey PDF.

Probably UCS-2/UTF-16 encoding with ascii data.

mrguyorama

0 replies

21h48m

2024-07-19 20:41:54 UTC

The problem ISN'T the null character though. The problem is that they tested the system, THEN changed stuff, then uploaded the changed stuff.

Their standard methodology was to deploy untested stuff.

snowwrestler

5 replies

19h11m

2024-07-19 23:18:44 UTC

I have flagged this because it is wrong, and there is no other way to kick it off the front page.

Finding a file full of zeroes on a broken computer does not mean it was shipped as all zeroes!

https://x.com/craiu/status/1814339965347610863

https://x.com/cyb3rops/status/1814329155833516492

execveat

3 replies

18h53m

2024-07-19 23:36:40 UTC

CrowdStrike does this trick where it replaces the file (being transferred over a network socket) with zeroes if it matches the malware signature. Assuming that these are the malware signature files themselves, a match wouldn't be surprising.

bombcar

1 replies

17h48m

2024-07-20 00:42:02 UTC

CrowStrike foot gunning itself would be amusing, if expected.

kristjansson

0 replies

15h36m

2024-07-20 02:53:43 UTC

Far and away the most entertaining of the possible root causes.

tsavo

0 replies

15h32m

2024-07-20 02:57:32 UTC

This actually makes the most sense, and would help explain how the error didn't occur during testing (in good faith, I assume it was tested).

In testing, the dev may have worked from their primary to deploy the update to a series of secondary drives, then sequential performed a test boot from each secondary drive configured for each supported OS version. A shortcut/quick way to test that would've bypassed how their product updates in customer environments, also bypassing checks their software may have performed (in this case, overwriting their own file's contents).

dang

0 replies

23m

2024-07-20 18:06:47 UTC

Ah thanks. I've made the title questionable now.

MilStdJunkie

4 replies

23h10m

2024-07-19 19:19:55 UTC

Holy smokes. I'm no programmer, but I've built out bazillions of publishing/conversion/analysis systems, and null purge is pretty much the first thing that happens, every time. x00 breaks virtually everything just by existing - like, seriously, markup with one of these damn things will make the rest of the system choke and die as soon as it looks at it. Numpy? Pytorch? XSL? Doesn't matter. cough cough cough GACK

And my systems are all halfass, and I don't really know what I'm doing. I can't imagine actual real professionals letting that moulder its way downstream. Maybe their stuff is just way more complex and amazing than I can possibly imagine.

wormlord

2 replies

22h2m

2024-07-19 20:27:58 UTC

Not a C programmer, why is 0x00 so bad? It's the string terminator character right?

tedunangst

0 replies

21h29m

2024-07-19 21:00:55 UTC

It's a byte like any other. You're more likely to see big files full of 0x0 than 0x1, but it's really not so different.

bagful

0 replies

21h44m

2024-07-19 20:45:37 UTC

Indeed, '\0' is the sentinel character for uncounted strings in C, and even if your own counted string implementation is “null-byte clean”, aspects of the underlying system may not be (Windows and Unix filesystems forbid embedded null characters, for example).

hawski

0 replies

21h15m

2024-07-19 21:14:24 UTC

Binary files are full of null bytes it is one of the main criteria of binary file recognition. Also large swaths of null bytes are also common, common enough we have sparse files - files with holes in them. Those holes are all zeroes, but are not allocated in the file system. For an easy example think about a disk image.

thrill

3 replies

23h20m

2024-07-19 19:10:00 UTC

So ... the checksum of all the Seinfeld episodes?

geor9e

2 replies

23h12m

2024-07-19 19:17:27 UTC

Explain joke please

themagicteeth

0 replies

23h9m

2024-07-19 19:21:10 UTC

Seinfeld is a show about nothing

http://seinfeldscripts.com/ThePitch.htm

WhyCause

0 replies

23h8m

2024-07-19 19:21:28 UTC

Seinfeld was "a show about nothing."

neffy

3 replies

23h15m

2024-07-19 19:15:22 UTC

There's a claim over on Mastodon from Kevin Beaumont that the file is different on every customer he´s received the file from.

https://cyberplace.social/@GossiTheDog/112812454405913406

(scroll down a little)

drewg123

2 replies

21h45m

2024-07-19 20:44:49 UTC

I thought windows required all kernel modules to be signed..? If there are multiple corrupt copies, rather than just some test escape, how could they have passed the signature verification and been loaded by the kernel?

dist-epoch

1 replies

20h55m

2024-07-19 21:34:50 UTC

This is not even a valid executable.

Most likely is not loaded as a driver binary, but instead is some data file used by the CrowdStrike driver.

Izkata

0 replies

19h17m

2024-07-19 23:12:28 UTC

Seems like a data file: https://news.ycombinator.com/item?id=41004103

motohagiography

2 replies

21h17m

2024-07-19 21:13:20 UTC

conspiracy prediction: I don't think CS will give a complete public RCA on it, but I do think the impact and crisis will be a pretext for granting new internet governance powers, maybe via EO, or a co-ordinated international response via the UN/ITU and the EU.

cherryteastain

1 replies

21h6m

2024-07-19 21:24:01 UTC

EU recently passed a law in this domain: https://www.eiopa.europa.eu/digital-operational-resilience-a...

motohagiography

0 replies

1h0m

2024-07-20 17:29:48 UTC

DORA is coincidental, but worth watching as a potential source of authority under some partisan disinformation initiatives. security in general is governance, where previously it had a high bar for competence to practice, and you can see it getting stacked in institutions of people without practitioner competence who can be relied on to defer to narrative.

as soon as security can come fully uncoupled from physical competence and the factual constraints of engineering, it will get subsumed into the managerialist hivemind as a rhetorical lever on narrative actuated cogs, and become as meaningless as everything else they do in service to administrative power for its own sake. I have a real problem with these people. :)

keketi

2 replies

18h23m

2024-07-20 00:06:59 UTC

Found this post on 4chan's technology board:

What actually happened was two-folded:

Step 1: CSAgent.sys is a certified durgasoft kernel driver that parses clownstrike virus definition files

Step 2: Clownstrike never fixed cases for malformed virus definition files that could trigger an invalid memory access in CSAgent.sys

Step 3: Clownstrike ships the buggy CSAgent.sys and it works for months without problems on customer computers with good virus definition files

Step 4: For some reason the webserver serving virus definition files to all endpoints started serving malformed virus definition files, sometimes they were all blank or filled with random bytes

Step 5: All clownstrike updaters now download malformed LE NEXT GEN APT PROTECTION CLOUD AI LIVE UPDATES into C:\Windows\System32\drivers\clownstrike

Step 6: CSAgent.sys reloads virus definition files

Step 7: CSAgent.sys parses them and crashes with PAGE_FAULT_IN_NONPAGED_AREA (in kernel that means memory at an oopsie address was accessed)

Step 8: Computer BSOD and reboots

Step 9: CSAgent.sys loads virus definition files

Step 10: Goto Step 7

The kernel driver was a ticking timebomb just waiting for a bug in the CDN. I think it was some funny caching bug with le cloud http loadbalancer. Users reported that their C-00000291-00000000-00000032.sys contained random shit and actual real files that were a completely different part of the software, like US localization files.

You can see it in the diff between CSAgent.sys_18511.sys and CSAgent.sys_18513.sys, they changed size checks and increased sizes of buffers so that future malformed virus definition files wouldnt crash.

justinclift

0 replies

9h18m

2024-07-20 09:12:12 UTC

For some reason the webserver serving virus definition files to all endpoints started serving malformed virus definition files, sometimes they were all blank or filled with random bytes

That sounds like some virus protection on the server was stopping the reads from disk, and instead of throwing an error it was providing 0's for the output data instead.

It'll be funny if there's actually some antivirus package that caused it. ;)

denisdamico

0 replies

13h43m

2024-07-20 04:46:47 UTC

Step 4. If this has occurred so many times, it is a hacker attack, targeting the poorly written Kernel driver. (But hey, right now at this election moment, with Trump's beloved company. It could be a show of strength)

fourteenfour

1 replies

23h21m

2024-07-19 19:08:41 UTC

At least it compressed well, which must have saved network resources during the update. :)

sitkack

0 replies

22h30m

2024-07-19 20:00:09 UTC

Resources like wall clock time.

cmrdporcupine

1 replies

21h49m

2024-07-19 20:40:35 UTC

Is there not responsibility at some level as well to Microsoft for having a kernel which even loaded this? Not just because of the apparent corruption, but also ... it was, I heard.. signed and given a bit of an MS blessing.

This crap shouldn't be run in kernel space. But putting that aside, we need kernels that will be resilient to and reject this stuff.

ale42

0 replies

21h31m

2024-07-19 20:59:03 UTC

The thing is that, despite the file has a confusing .sys extension, it's not the driver, but rather a data file loaded by the Crowdstrike driver.

bb88

1 replies

20h14m

2024-07-19 22:16:23 UTC

We've had security software in the past break software compilation in this method by replacing entire files with zeros. I'm not saying this is the case, but it wouldn't surprise me if it were.

Basically the linker couldn't open the file on windows (because it was locked by another process scanning it), and didn't error. Just replaced the object code to be linked with zeros.

People couldn't figure out what was wrong until they opened a debugger and saw large chunks of object code replaced with zeros.

heywire

0 replies

19h26m

2024-07-19 23:04:15 UTC

The number of times I encounter issues with Visual Studio not being able to write to the file I just compiled, ugh... We've even got an mt.exe wrapper that adds retries because AV likes to grab hold of newly created binaries right around the time mt.exe needs to write to it. Even our CI build process ends up failing randomly due to this.

OutOfHere

1 replies

23h2m

2024-07-19 19:28:05 UTC

This looks like a test file that got deployed. Perhaps a QA test was newly added which ran and overwrote the build. This is all I can think of.

OutOfHere

0 replies

18h25m

2024-07-20 00:05:01 UTC

This could happen if there was a test for the above QA test. Such a meta test would necessarily have to write a test file to the same path as the build path. Unfortunately, the meta tests could have got incorporated into the same suite as the regular tests. Someone then could have decided to activate a test order randomizer for fun. The random order finally hit the anti-jackpot.

yamumsahoe

0 replies

19h41m

2024-07-19 22:49:13 UTC

thats a lot of prod to test in.

slashdave

0 replies

20h13m

2024-07-19 22:16:30 UTC

I don't get it. Shouldn't the file have a standard format, with a required header and checksum (among other things), that the driver checks before executing?

olliej

0 replies

21h1m

2024-07-19 21:29:01 UTC

We can argue all we want about CI infrastructure, manual testing, test nets/deployment, staged deployment.

All of that is secondary: they wrote and shipped code that blindly loaded and tried to parse content from the network, and crashed when that failed. In kernel mode.

Honestly it’s probably good that this happened, because presumably someone malicious could use this level of broken logic to compromise kernel space.

Certainly the trust they put in the safety of parsing content downloaded from the internet makes me wonder about the correctness of their code for reading data from userspace.

markus_zhang

0 replies

22h10m

2024-07-19 20:19:55 UTC

I'm starting to think that the timing (Friday) and the scale as well as other things (like this finding) might -- just might point to a bad actor.

We will probably have to wait for CS' own report.

kristjansson

0 replies

23h17m

2024-07-19 19:12:28 UTC

Something like

   zero_output_file(fh, len(file))

   flush()

   fill_output_file(fh, data)

with an oops in line 3?

jmspring

0 replies

19h32m

2024-07-19 22:57:37 UTC

Poor testing. But we also need to stop CISOs, etc doing "checkbox compliance" and installing every 3rd party thing on employee laptops. My prior employer, there were literally 13 things installed for "laptop security" - 1/2 of them overlapped. Developers had the same policy as an AE and as a Sales Engineer as well as an HR person. Crowdstrike was one of the worst. Updating third party packages in go was 30-40% faster in an emulated arm64 VM (qemu) - virtualized disk / disk just a large file - on an Intel MBP compared to doing the same operation on the native system in OSX.

j-wags

0 replies

23h8m

2024-07-19 19:22:15 UTC

It's possible that these aren't the original file contents, but rather the result of a manual attempt to stop the bleeding.

Someone may have hoped that overwriting the bad file with an all-0 file of the correct size would make the update benign.

Or following the "QA was bypassed because there was a critical vulnerability" hypothesis, stopping distribution of the real patch may be an attempt to reduce access to the real data and slow reverse-engineering of the vulnerability.

havkom

0 replies

13h9m

2024-07-20 05:20:28 UTC

Lots of very strong opinions here what they should have done in terms of code/pipelines/etx from people who have no idea how CrownStrikes stack or build systems works and people who have no clue how to write that type of software themselves (including Windows kerneln modules).

fhub

0 replies

20h4m

2024-07-19 22:26:11 UTC

Anytime critical infrastructure goes down I always have a fleeting thought back to “Spy Game” movie where the CIA cut power to part of a Chinese city to help with a prison escape.

denisdamico

0 replies

14h34m

2024-07-20 03:55:43 UTC

I assume the Kernel driver itself has been signed, tested and verified by MS. The channel update files would be the responsibility of the CS and contain most of the information, including urgent (untested) information. This CS kernel driver is there to inspect, collect everything, receive CS settings and commands, and act in privileged mode. It's not a so complex system, the intelligence is in CS, that's why MS partners with them. But the code in the driver could at least do a perfect input check. Shits happens. Blaming MS: I want to believe they do extensive sandbox testing, but in this case they could do better canary deployments, staged updates and not affect worldwide.

commandersaki

0 replies

13h40m

2024-07-20 04:50:20 UTC

I wonder if this is linked to the Azure outage that happened just prior to the Crowdstrike incident.

ThinkBeat

0 replies

21h41m

2024-07-19 20:48:48 UTC

Maybe Crowdstrike has adopted the modern ethos Move fast and break things With continuous integration we ship a thousand times a day. fuck QA.

Thaxll

0 replies

19h50m

2024-07-19 22:39:57 UTC

I'm not versed enough into windows loading dll / driver, but isn't the caller able to handle that situation? Or windows itself? Does loading an empty file driver can be handled in a way that it does not make the OS crash?