return to table of content

Initial details about why CrowdStrike's CSAgent.sys crashed

delta_p_delta_x
113 replies
11h50m

The moment I read 'it is a content update that causes the BSOD, deleting it solves the problem', I was immediately willing to bet a hundred quid (for the non-British, that's £100) that it was a combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data (in this case, read an array of pointers, didn't verify that all of them were both non-null and pointed to valid data/code).

In the past ten years or so of having done somewhat serious computing and zero cybersecurity whatsoever, I have my mind concluded, feel free to disagree.

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This includes everything from: decompression algorithms; font outline readers; image, video, and audio parsers; video game data parsers; XML and HTML parsers; the various certificate/signature/key parsers in OpenSSL (and derivatives); and now, this CrowdStrike content parser in its EDR program.

That wager stands, by the way, and I'm happy to up the ante by £50 to account for my second theory.

mtlmtlmtlmtl
64 replies
9h36m

There's at least five different things that went wrong simultaneously.

1. Poorly written code in the kernel module crashed the whole OS, and kept trying to parse the corrupted files, causing a boot loop. Instead of handling the error gracefully and deleting/marking the files as corrupt.

2. Either the corrupted files slipped through internal testing, or there is no internal testing.

3. Individual settings for when to apply such updates were apparently ignored. It's unclear whether this was a glitch or standard practice. Either way I consider it a bug(it's just a matter of whether it's a software bug or a bug in their procedures).

4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.

5. Whatever caused the corruption in the first place, which is anyone's guess.

rco8786
20 replies
6h3m

Number 4 continues to be the most surprising bit to me. I could not fathom having a process that involves deploying to 8.5 million remote machines simultaneously.

Bugs in code I can almost always understand and forgive, even the ones that seem like they’d be obvious with hindsight. But this is just an egregious lack of the most basic rollout standards.

gitfan86
7 replies
5h41m

They probably don't get to claim agile story points until the ticket is in finished state. And they probably have a culture where vanity Metrics like "velocity" are prioritized

nmg
5 replies
5h21m

This would answer the question that i've not heard anyone asking:

what incentivized the bad decisions that led to this catastrophic failure?

phs318u
4 replies
5h10m

My understanding is that the culture (as reported by some customers) is quite aggressive and pushy. They are quite vocal when customers don’t turn in automatic updates.

It makes sense in a way - given their fast growth strategy (from nowhere to top 3) and desire to “do things differently” - the iconoclast upstarts that redefine the industry.

Or to summarise - hubris.

77pt77
2 replies
2h51m

They are quite vocal when customers don’t turn in automatic updates.

I'm sorry but this is the customer's fault.

If I'm using your services you work for me and you don't get to bully me into doing whatever you think needs to be done.

People that chose this solution need to be penalized, but they won't.

mbreese
1 replies
2h38m

Customers don’t always have a choice here. They could be restricted by compliance programs (PCI, et al) and be required under those terms to have auto updates on.

Compliance also has to share some of the blame here, if best practices (local testing) aren’t allowed to be followed in the name of “security”.

nerdjon
0 replies
2h24m

This needs to keep being repeated anytime someone wants to blame the company.

Many don’t have a choice, a lot of compliance is doing x to satisfy a checkbox and you don’t have a lot of flexibility in that or you may not be able to things like process credit cards which is kinda unacceptable depending on your company. (Note: I didn’t say all)

CrowdStrike automatic update happens to satisfy some of those checkboxes.

hello_moto
0 replies
4h6m

To catch 0day quickly, EDR needs to know "how".

The "how" here is AV definition or a way to identify the attack. In CS-speak: content.

Catching 0day quickly results in good reputation that your EDR works well.

If people turn off their AV definition auto-update, they are at-risk. Why use EDR if folks don't want to stop attack quickly?

cruffle_duffle
0 replies
2h51m

Oh the games I have to play with story points that have personal performance metrics attached to them. Splitting tickets to span sprints so there aren’t holes in some dudes “effort” because they didn’t compete some task they committed to.

I never thought such stories were real until I encountered them…

thundershart
6 replies
4h22m

Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.

But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.

On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.

I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?

kiitos
2 replies
2h27m

But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production

In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.

In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

suzzer99
0 replies
34m

Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.

I don't know how you could assert that this is impossible, hence channel files should be treated as code.

cozzyd
0 replies
45m

Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

stoolpigeon
1 replies
3h52m

I think point 3 of the grand parent indicates admins were not given an opportunity to test this.

My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)

Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.

thundershart
0 replies
3h30m

Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.

Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.

volkl48
0 replies
1h29m

It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.

Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.

At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.

mbreese
1 replies
2h41m

For me, number 1 is the worst of the bunch. You should always expect that there will be bugs in processes, input files, etc… the fact that their code wasn’t robust enough to recognize a corrupted file and not crash is inexcusable. Especially in kernel code that is so widely deployed.

If any one of the five points above hadn’t happened, this event would have been avoided. However, if number 1 had been addressed - any of the others could have happened (or all at the same time) and it would have been fine.

I understand that we should assume that bugs will be present anywhere, which is why staggered deployments are also important. If there had been staggered deployments, the. The damage would have happened, but it would have been localized. I think security people would argue against a staged deployment though, as if it were discovered what the new definitions protected against, an exploit could be developed quickly to put those servers that aren’t in the “canary” group at risk. (At least in theory — I can’t see how staggering deployment over a 6-12 hour window would have been that risky).

timmytokyo
0 replies
2h15m

They're all terrible, but I agree #1 is particularly egregious for a company ostensibly dedicated to security. A simple fuzz tester would have caught this type of bug, so they clearly don't perform even a minimal amount of testing on their code.

mrbombastic
0 replies
8m

And here I thought shipping a new version on the app store was scary.

Is there anything we can take from other professions/tradecraft/unions/legislation to ensure shops can’t skip the basic best practices we are aware of in the industry like staged rollouts? How do we set un incentives to prevent this? Seriously the App Store was raking in $$ from us for years with no support for staged rollouts and no other options.

layer8
0 replies
2h8m

Malware signature updates are supposed to be deployed ASAP, because every minute may count when a new attack is spreading. The mistake may have been to apply that policy indiscriminately.

avree
0 replies
1m

A lot of snarky replies to this comment, but the reality is that if you were selling an anti-virus, identified a malicious virus, and then chose not to update millions of your machines with that virus’s signature, you’d also be in the wrong.

rwmj
17 replies
9h25m

Zero effort to fuzz test the parser too. I mean, we know how to harden parsers against bugs and attacks, and any semi-competent fuzzer would have caught such a trivial bug.

chrisjj
11 replies
9h12m

The triggering file was all zeros.

It is not possible that only this pattern caused the crash, and fuzzing omitted to try this unfuzzy pattern?

gliptic
3 replies
8h47m

No, it wasn't. Crowdstrike denied it had to do with zeros in the files.

jojobas
2 replies
8h4m

At this point I wouldn't be paying too much attention to what Crowdstrike is saying.

hello_moto
1 replies
4h3m

Have to speak the truth albeit at minimum, in case legal...

kchr
0 replies
2h37m

Which also explains why they, only if needed to cover their back legally, confirm or deny details being shared on social and mass media.

Retr0id
1 replies
7h23m

Competent fuzzers don't just use random bytes, they systematically explore the state-space of the target program. If there's a crash state to be found by feeding in a file full of null bytes, it's probably going to be found quickly.

A fun example is that if you point AFL at a JPEG parser, it will eventually "learn" to produce valid JPEG files as test cases, without ever having been told what JPEG file is supposed to look like. https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...

rwmj
0 replies
6h10m

AFL is really "magical". It finds bugs very quickly and with little effort on our part except to leave it running and look at the results occasionally. We use it to fuzz test a variety of file formats and network interfaces, including QEMU image parsing, nbdkit, libnbd, hivex. We also use clang's libfuzzer with QEMU which is another good fuzzing solution. There's really no excuse for CrowdStrike not to have been using fuzzing.

watwut
0 replies
8h47m

Possible? Yes. Likely? No.

omeid2
0 replies
8h28m

The files in question has a magic number is "0xAAAAAAAA" so it is not possible that the file was all zeros.

monsieurbanana
0 replies
8h38m

In my limited experience, I thought any serious fuzzing program does test for all "standard" patters like only null bytes, empty strings, etc...

formerly_proven
0 replies
8h33m

Instrumented fuzzing (like AFL and friends) tweaks the input to traverse unseen code paths in the target, so they're super quick to find stuff like "heyyyyy, nobody is actually checking if this offset is in bounds before loading from that address".

mavhc
3 replies
9h10m

AV software is a great target for malware, badly written, probably runs too much stuff in the kernel, tries to parse everything

Comfy-Tinwork
2 replies
7h3m

And at the very least straight to system level access if not more.

londons_explore
0 replies
5h41m

AV software needs kernel privilidges to have access to everything it needs to inspect, but the actual inspection of that data should be done with no privilidges.

I think most AV companies now have a helper process to do that.

If you successfully exploit the helper process, the worst damage you ought to be able to do is falsely find files to be clean.

MyFedora
0 replies
6h26m

Anti-cheats also whitelist legit AV drivers, even though cheaters exploit them to no end.

jatins
0 replies
5h25m

You are seriously overestimating the engineering practises at these companies. I have worked in "enterprise security" previously though not at this scale. In a previous life I worked with of the engineering leaders currently at Crowdstrike.

I'll bet you this company has some arbitrary unit test coverage requirements for PRs which developers game be mocking the heck out of dependencies. I am sure they have some vanity sonarqube integration to ensure great "code quality". This likely also went through manual QA.

However I am sure the topic of fuzz testing would not have come up once. These companies sell checkbox compliance, and they themselves develop their software the same way. Checking all the "quality engineering" boxes with very little regards for long term engineering initiatives that would provide real value.

And I am not trying to kick Crowdstrike when they are down. It's the state of any software company run by suits with myopic vision. Their engineering blogs and their codebases are poles apart.

dartos
8 replies
6h26m

Bugs happen.

Not staggering the updates is what blew my mind.

londons_explore
7 replies
5h37m

Since the issue manifested at 04:09 UTC, which is 11pm where Crowdstrikes HQ is, I would guess someone was working late at night and skipped the proper process so they could get the update done and go to bed.

They probably considered it low risk, had done similar things of times hundreds of times before, etc.

hello_moto
4 replies
4h1m

Companies these days are global btw.

Not everyone is working on the same timezone.

londons_explore
3 replies
3h32m

They don't appear to have engineering jobs in any location where that would be considered regular office hours...

londons_explore
1 replies
1h36m

0409UTC is 07:09 AM in Israel. Doubt an engineer was doing a push then either...

All the other engineering locations seem even less likely.

vitus
0 replies
25m

On Friday, no less. (Israel's weekend is Friday / Saturday instead of the usual Saturday / Sunday.)

kchr
0 replies
2h31m

A good reminder of the fact that your Thursday might be someone else's Friday.

dartos
0 replies
5h22m

They probably considered it low risk

Wild that anyone would consider anything in the “critical path” low risk. I would bet that they just don’t do rolling releases normally since it never caused issues before.

ratorx
5 replies
7h39m

I’d also maybe add another one on the Windows end:

6) some form of sandboxing/error handling/api changes to make it possible to write safer kernel modules (not sure if it already exists and was just not used). It seems like the design could be better if a bad kernel module can cause a boot loop in the OS…

layer8
2 replies
2h2m

It’s a tough problem, because you also don’t want the system to start without the CrowdStrike protection. Or more generally, a kernel driver is supposedly installed for a reason, and presumably you don’t want to keep the system running if it doesn’t work. So the alternative would be to shut down the system upon detection of the faulty driver without rebooting, which wouldn’t be much of an improvement in the present case.

ratorx
1 replies
1h0m

I can imagine better defaults. Assuming the threat vector is malicious programs running in userspace (probably malicious programs in kernel space is game over anyway right?), then you could simply boot into safe mode or something instead of crashlooping.

One of the problems with this outage was that you couldn’t even boot into safe mode without having the bit locker recovery key.

layer8
0 replies
57m

You don’t want to boot into safe mode with networking enabled if the software that is supposed to detect attacks from the network isn’t running. Safe mode doesn’t protect you from malicious code in userspace, it only “protects” you from faulty drivers. Safe mode is for troubleshooting system components, not for increasing security.

I don’t know the exact reasoning why safe mode requires the BitLocker recovery key, but presumably not doing so would open up an attack vector defeating the BitLocker protection.

leosarev
1 replies
6h14m

There is sandboxing API in Windows. It's called running programs in userspace.

hello_moto
0 replies
3h44m

Run what a userspace?

rainsford
1 replies
5h38m

2. Either the corrupted files slipped through internal testing, or there is no internal testing.

This is the most interesting question to me because it doesn't seem like there is an obviously guessable answer. It seem very unlikely to me that a company like CrowdStrike pushes out updates of any kind without doing some sort of testing, but the widespread nature of the outage would also seem to suggest any sort of testing setup should have caught the issue. Unless it's somehow possible for CrowdStrike to test an update that was different than what was deployed, it's not obvious what went wrong here.

bloopernova
0 replies
5h18m

I had read somewhere that the definition file was corrupted after testing, during the final CI/CD pipeline.

hulitu
1 replies
7h44m

6. No development process, no testing.

krisoft
0 replies
6h54m

How is that different from point 2?

dcuthbertson
1 replies
3h47m

I wonder if it was pushed anywhere that didn't crash, as an extension of "It works on my machine. Ship it!"

I've built a couple of kernel drivers over the years and what I know is that ".sys" files are to the kernel as ".dll" files are to user-space programs in that the ones with code in them run only after they are loaded and a desired function is run (assuming boilerplate initialization code is good).

I've never made a data-only .sys file, but I don't see why someone couldn't. In that case, I'd guess that no one ever checked it was correct, and the service/program that loads it didn't do any verification either -- why would it, the developers of said service/program would tend to trust their own data .sys file would be valid, never thinking they'd release a broken file or consider that files sometimes get corrupted -- another failure mode waiting to happen on some unfortunate soul's computer.

kchr
0 replies
2h33m

The file extension is `sys` by convention, it's nothing magical to it and it's not handled in any special way by the OS. In the case of CrowdStrike, there seems to be some confusion as to why they use this file extension since it's only supposed to be a config/data file to be used by the real kernel driver.

simonh
0 replies
9h12m

There is a story out that the problem was introduced in a post processing step after testing. That makes more sense than that there was no testing. If true it means they thought they’d tested the update, but actually hadn’t.

pclmulqdq
0 replies
4h34m

Number 4 is what everyone will fixate on, but I have the biggest problem with number 1. Anything like this sort of file should have (1) validation on all its pointers and (2) probably >2 layers of checksumming/signing. They should generally expect these files to get corrupted in transit once in a while, but they didn't seem to plan for anything other than exactly perfect communication between their intent and their kernel driver.

cynicalsecurity
0 replies
2h1m

I'm betting on them having no internal testing.

bradley13
10 replies
9h34m

No bet. There are two failures here. (1) Failing to check the data for validity, and (2) Failing to handle an error gracefully.

Both of these are undergraduate-level techniques. Heck, they are covered in most first-semester programming courses. Either of these failures is inexcusable in a professional product, much less one that is running with kernel-level privileges.

Bet: CrowdStrike has outsourced much of its development work.

ahoka
8 replies
8h58m

What do you mean by outsourced?

Rinzler89
5 replies
8h42m

He probably means work was sent offshore to offices with cheaper labor that's less skilled or less vested into delivering quality work. Though there's no proof of that yet, people just like to throw the blame on offshoring whoever $BIG_CORP fucks up, as if all programmers in the US are John Carmack and they can never cause catastrophic fuckups with their code or processes.

jojobas
3 replies
8h2m

Not everyone in the US might be Carmack, but it's ridiculously nearsighted to assert that cultural differences don't play into people desire and ability to Do It Right.

Rinzler89
2 replies
6h17m

It's not cultural differences that make the difference in output quality, it's pay and quality standards of the output set by the team/management, which is also mostly a function of pay, since underpaid and unhappy developers tend not to care at all beyond doing the bare minimum to not getting fired (#notmyjob, laying flat movement, etc).

You think everyone writing code in the US would give two shits about the quality of their output if they see the CEO pocketing another private jet while they can barley make big-city rent?

Hell, even well paid devs at top companies in the US can be careless and lazy if their company doesn't care about quality. Have you seen some of the vulnerabilities and bugs that make it into the Android source code and on Pixel devices? And guess what, that code was written by well paid developers in the US, hired at Google leetcode standards, yet would give far-east sweatshops a run for their money in terms of carelessness. It's what you get when you have a high barrier of entry but a low barrier of output quality where devs just care about "rest and vest".

bradley13
1 replies
2h24m

I was talking about outsourcing (and not necessarily offshoring). Too many companies like CrowdStrike are run by managers who think that management, sales, and marketing are the important activities. Software development is just an unpleasant expense that needs to be minimized. Hence: outsourcing.

That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!

My experience with "typical" programmers from India, China, et al is that they do exactly what they are told. Their boss makes the design decisions down to the last detail, and the "programmers" are little more than typists. I specifically remember one sweatshop where the boss looped continually among the desks, giving each person very specific instructions of what they were to do next. The individual programmers implemented his instructions literally, with zero thought and zero knowledge of the big picture.

Even if the boss was good enough to actually keep the big picture of a dozen simultaneous activities in his head, his non-thinking minions certainly made mistakes. I have no idea how this all got integrated and tested, and I probably don't want to know.

Rinzler89
0 replies
1h30m

>That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!

Sure but there's no proof yet that was the case here. That's just masive speculations based on anecdotes on your side. There's plenty of offshore devs that can run rings around western devs.

ahoka
0 replies
7h19m

Offshoring and outsourcing is very different. It would be also very hard to talk about offshoring at a company claiming to provider services in 170 countries.

spotplay
1 replies
7h6m

It's probably just the common US-centric bias that external development teams, particularly those overseas, may deliver subpar software quality. This notion is often veiled under seemingly intellectual critiques to avoid overt xenophobic rhetoric like "They're taking our jobs!".

Alternatively, there might be a general assumption that lower development costs equate to inferior quality, which is a flawed yet prevalent human bias.

chuckadams
0 replies
4h3m

“You get what you pay for” is still a reasonable metric, even if it is more a relative scale than an absolute one.

danielPort9
0 replies
8h16m

Either of these failures is inexcusable in a professional product

Don’t we have those kind of failures in almost every professional product? I’ve been working in the industry for over a decade and in every single company we had those bugs. The only difference was that none of those companies were developing kernel modules or whatever. Simple saas. And no, none of the bugs were outsourced (the companies I worked for hired only locals and people in the range of +- 2h time zone)

bostik
4 replies
10h39m

Approximately 100% of CVEs, crashes, bugs, [...], deserialising binary data

I'd make that 98%. Outside of rounding errors in the margins, the remaining two percent is made up of logic bugs, configuration errors, bad defaults, and outright insecure design choices.

Disclosure: infosec for more than three decades.

epanchin
1 replies
10h29m

They forgot to account for those edge cases

delta_p_delta_x
0 replies
10h28m

Heh, touché.

delta_p_delta_x
1 replies
10h27m

I feel vindicated but also a bit surprised that my gut feeling was this accurate.

bostik
0 replies
9h25m

Not really a surprise, to be honest. "Deserialisation" encapsulates most forms of injection attacks.

OWASP top-10 was dominated by those for a very long time. They have only recently been overtaken by authorization failures.

throw0101d
3 replies
6h56m

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures.

For the record, the top 25 common weaknesses for 2023 are listed at:

* https://cwe.mitre.org/top25/archive/2023/2023_top25_list.htm...

Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.

CWEs that have been in every list since they started doing this (2019):

* https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weakn...

lioeters
2 replies
6h13m

# Top Stubborn Software Weaknesses (2019-2023)

Out-of-bounds Write

Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)

Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)

Use After Free

Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

Improper Input Validation

Out-of-bounds Read

Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)

Cross-Site Request Forgery (CSRF)

NULL Pointer Dereference

Improper Authentication

Integer Overflow or Wraparound

Deserialization of Untrusted Data

Improper Restriction of Operations within Bounds of a Memory Buffer

Use of Hard-coded Credentials

TeMPOraL
1 replies
2h49m

Yup. Almost all of them are various flavor of fucking up a parser or misusing it (in particular, all the injection cases are typically caused by writing stupid code that glues strings together instead of proper parsing).

lolinder
0 replies
3m

That's not parsing, that's the inverse of parsing. It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly. It's compiling, of a sort.

Parsing is the reverse—taking an untrusted string (or binary string) that is meant to be code and converting it into a data structure.

Both are the result of taking untrusted data and assuming it'll look like what you expect, but both are not parsing issues.

Sakos
3 replies
10h16m

I can't decide what's more damning. The fact that there was effectively no error/failure handling or this:

Note "channel updates ...bypassed client's staging controls and was rolled out to everyone regardless"

A few IT folks who had set the CS policy to ignore latest version confirmed this was, ya, bypassed, as this was "content" update (vs. a version update)

If your content updates can break clients, they should not be able to bypass staging controls or policies.

vladvasiliu
1 replies
9h46m

The way I understand it, the policy the users can configure are about "agent versions". I don't think there's a setting for "content versions" you can toggle.

sateesh
0 replies
8h9m

Maybe there isn't a switch that says "content version",but from end user perspective it is a new version. Whether it was a content change, or just a fix for typo in documentation (say) the change being pushed is different than what currently exists.And for the end user the configuration implies that they have a chance to decide whether to accept any new change being pushed or not.

SoftTalker
0 replies
2h28m

If your content updates can break clients

This is going to be what most customers did not realize. I'm sure Crowdstrike assured them that content updates were completely safe "it's not a change to the software" etc.

Well they know differently now.

noobermin
2 replies
7h18m

So, I also have near zero cybersecurity expertise (I took an online intro course on cryptography due to curiousity) and no expertise in writing kernel modules actually, but why if ever would you parse an array of pointers...in a file...instead of any other way of serializing data that doesn't include hardcoded array offsets in an on-disk file...

Ignore this failure which was catastrophic, this was a bad design asking to be exploited by criminals.

deaddodo
0 replies
6h27m

I'm curious, how else would you store direct memory offsets? No matter how you store/transmit them, eventually you're going to need those same offsets.

The problem wasn't storing raw memory offsets, it was not having some way to validate the data at runtime.

Jare
0 replies
6h47m

Performance, I assume. Right now it may look like the wrong tradeoff, but every day in between incidents like this we're instead complaining that software is slow.

Of course it doesn't have to be either/or; you can have fast + secure, but it costs a lot more to design, develop, maintain and validate. What you can't have is a "why don't they just" simple and obvious solution that makes it cheap without making it either less secure, less performant, or both.

Given all the other mishaps in this story, it is very well possible that the software is insecure (we know that), slow and also still very expensive. There's a limit to how high you can push the triangle, but there's not bottom to how bad it can get.

1992spacemovie
2 replies
6h57m

Interesting observation. As a non-developer, what can one do to enhance coverage for these types of scenerios? Fuzz testing?

rwmj
1 replies
6h4m

Fuzz testing absolutely should be used whenever you parse anything.

SoftTalker
0 replies
2h31m

Yeah, even if you are only parsing "safe" inputs such as ones you created yourself. Other bugs and sometimes even truly random events can corrupt data.

stefan_
1 replies
3h52m

People are target fixating too much. Sure, this parser crashed and caused the system to go down. But in an alternative universe they push a definition file that rejects every openat() or connect() syscall. Your system is now equally as dead, except it probably won't even have the grace to restart.

The whole concept of "we fuck with the system in kernel based on data downloaded from the internet" is just not very sound and safe.

hello_moto
0 replies
3h40m

It's not and that's the sad state of AV in Windows

smackeyacky
1 replies
10h21m

Hmmm. Most common problems these days are certificate related I would have thought. Binary data transfers are pretty rare in an age of base64 json bloat

madaxe_again
0 replies
9h34m

There are plenty of binary serialisation protocols out there, many proprietary - maybe you’ll stuff that base64’d in a json container for transit, but you’re still dealing with a binary decoder.

miohtama
1 replies
10h15m

I was immediately willing to bet a hundred quid this was C/C++ code :)

formerly_proven
0 replies
8h30m

Not that interesting a bet considering we know it's a Windows driver.

lol768
1 replies
7h0m

I'm happy to up the ante by £50 to account for my second theory

What's that, three pints in a pub inside the M25? :P

Completely agree with this sentiment though, we've known that handling of binary data in memory unsafe languages has been risky for yonks. At the very least, fuzzing should've been employed here to try and detect these sorts of issues. More fundamentally though, where was their QA? These "channel files" just went out of the door without any idea as to their validity? Was there no continuous integration check to just .. ensure they parsed with the same parser as was deployed to the endpoints? And why were the channel files not deployed gradually?

TeMPOraL
0 replies
2h36m

FWIW, before someone brings up JSON, GP's bet only makes sense when "binary" includes parsing text as well. In fact, most notorious software bugs are related to misuse of textual formats like SQL or JS.

teeheelol
0 replies
9h35m

Yep.

Looking at how this whole thing is pasted together, there's probably a regex engine in one of those sys files somewhere that was doing the "parsing"...

xxs
0 replies
2h53m

(for the non-British, that's £100)

next time you'd be adding /s to your posts

variadix
0 replies
9h11m

More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.

seymore_12
0 replies
6h35m

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This. One year ago UK air traffic control collapsed due to inability to properly parse "faulty" flight plan: https://news.ycombinator.com/item?id=37461695

eru
0 replies
6h53m

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

I wouldn't blame imperative programming.

Eg Rust is imperative, and pretty good at telling you off when you forgot a case in your switch.

By contrast the variant of Scheme I used twenty years ago was functional, but didn't have checks for covering all cases. (And Haskell's ghc didn't have that checked turned on by default a few years ago. Not sure if they changed that.)

cedws
0 replies
5h38m

I’d say that it is a bug by definition if your program ungracefully crashes when it’s passed malformed data at runtime.

back_to_basics
0 replies
28m

"human programmers forget to account for edge cases"

Which is precisely the rationale which led to Standard Operating Procedures and Best Practices (much like any other Sector of business has developed).

I submit to you, respectfully, that a corporation shall never rise to a $75 Billion Market Cap without a bullet-proof adherence to such, and thus, this "event" should be properly characterized and viewed as a very suspicious anomaly, at the least

https://news.ycombinator.com/item?id=41023539 fleshes out the proper context.

G3rn0ti
79 replies
12h38m

By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards. So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs. It could be audited by the public. On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.

imiric
44 replies
12h5m

I'd say no. Kolide is one such attempt, and their practices, and how it's used in companies, are as insidious as those from a proprietary product. As a user, it gives me no assurance that an open source surveillance rootkit is better tested and developed, or that it has my best interests in mind.

The problem is the entire category of surveillance software. It should not exist. Companies that use it don't understand security, and don't trust their employees. They're not good places to work at.

WA
17 replies
11h9m

Companies that use it don't understand security

What should these companies understand about security exactly?

And aren’t they kinda right to not trust their employees if they employ 50,000 people with different skills and intentions?

InsideOutSanta
14 replies
10h22m

"And aren’t they kinda right to not trust their employees if they employ 50,000 people with different skills and intentions?"

Yes, in a 50k employee company, the CEO won't know every single employee and be able to vouch for their skills and intentions.

But in a non-dysfunctional company, you have a hierarchy of trust, where each management level knows and trusts the people above and below them. You also have siloed data, where people have access to the specific things they need to do their jobs. And you have disaster mitigation mechanisms for when things go wrong.

Having worked in companies of different sizes and with different trust cultures, I do think that problems start to arise when you add things like individual monitoring and control. You're basically telling people that you don't trust them, which makes them see their employer in an adversarial role, which actually makes them start to behave less trustworthy, which further diminishes trust across the company, harms collaboration, and eventually harms productivity and security.

kemotep
6 replies
7h28m

Setting aside the possibility of deploying an EDR like Crowdstrike just being a box ticking exercise for compliance or insurance purposes, can something like an EDR be used not because of a lack of trust but a desire to protect the environment?

A user doesn’t have to do anything wrong for the computer to become compromised, or even if they do, being able to limit the blast radius and lock down the computer or at least after the fact have collected the data to be able to identify what went wrong seems important.

How would you secure a network of computers without an agent that can do anti-virus, detect anomalies, and remediate them? That is to say, how would you manage to secure it without doing something that has monitoring and lockdown capabilities? In your words, signaling that you do not trust the users?

kchr
5 replies
5h58m

This. From all the comments I've seen in the multiple posts and threads about the incident, this simple fact seems to be the least discussed. How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection? Sure, these solutions like CrowdStrike et al are box-checking and risk transferring exercises in one sense, but they actually work as intended when it comes to protecting endpoints from novel malware and TTP:s. As long as they don't botch their own software, that is :D

imiric
3 replies
5h31m

How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection?

There is no straightforward answer to this question. Assuming that your infrastructure is "secure" because you deployed an EDR solution is wrong. It only gives you a false sense of security.

The reality is that security takes a lot of effort from everyone involved, and it starts by educating people. There is no quick bandaid solution to these problems, and, as with anything in IT, any approach has tradeoffs. In this case, and particularly after the recent events, it's evident that an EDR system is as much of a liability as it is an asset—perhaps even more so. You give away control of your systems to a 3rd party, and expect them to work flawlessly 100% of the time. The alarming thing is how much this particular vendor was trusted with critical parts of our civil infrastructure. It not only exposes us to operational failures due to negligence, but to attacks from actors who will seek to exploit that 3rd party.

matwood
1 replies
2h57m

starts by educating people

Any security certification has a section on regularly educating employees on the topic.

To your point, I agree that companies are attempting to bypass the hard work by deploying a tool and thinking they are done.

kchr
0 replies
1h34m

Absolutely, training is key. Alas, managers don't seem to want their employees spending time on anything other than delivering profit and so the training courses are zipped through just to mark them as completed.

Personally, I don't know how to solve that problem.

kchr
0 replies
1h39m

I totally agree. In my current work environment, we do deploy EDR but it is primarily for assets critical for delivering our main service to customers. Ironically, this incident caused them all to be unavailable and there is for sure a lesson to be learned here!

It is not considered a silver bullet by the security team, rather a last-resort detection mechanism for suspicious behavior (for example if the network segmentation or access control fails, or someone managed to get foothold by other means). It also helps them identify which employees need more training as they keep downloading random executables from the web.

morning-coffee
0 replies
4h30m

It is a good question. Is there a possibility of fundamentally fixing software/hardware to eliminate the vectors that malware exploits to gain a foot hold at all? e.g. not storing return address on the stack or letting it be manipulated by callee? memory bounds enforcement, either statically at compile time, or with the help of hardware, to prevent writing past memory not yours? (Not asking about feasibility of coexisting with or migrating from the current world, just about the possibility of fundamentally solving this at all...)

protomolecule
3 replies
9h4m

"But in a non-dysfunctional company, you have a hierarchy of trust, where each management level knows and trusts the people above and below them. "

Even in a company of two sometimes a husband or a wife betrays the trust. Now multiply that probability by 50000.

TeMPOraL
2 replies
6h38m

Yet we don't apply total surveillance to people. The reason isn't just ethics and US constitution, but also that it's just not possible without destroying society. Same perhaps applies to computer systems.

protomolecule
1 replies
6h33m

Which is a completely different argument

TeMPOraL
0 replies
6h29m

I think it doesn't. I think that the kind of security the likes of CrowdStrike promise is fundamentally impossible to have, and pursuing it is a fool's errand.

mylastattempt
1 replies
7h10m

I disagree. You seem to start from a premise that all people are honest, except those that aren't, but you don't work with or meet dishonest people, unless the employer sets himself up in an adversarial role?

As the other reply to your comment said: the world is not 'fair' or 'honest', that's just a lie told to children. Apart from geuinely evil people, there are unlimited variables that dictate people's behavior. Culture, personality, nutrition, financial situation, mood, stress, bully coworkers, intrinsic values, etc etc. To think people are all fair and honest "unless" is a really harmful worldview to have and in my opinion the reason for a lot of bad things being allowed to happen and continue (troughout all society, not just work).

Zero-trust in IT is just the digitized version of "trust is earned". In computers you can be more crude and direct about it, but it should be the same for social connections and interactions.

matwood
0 replies
2h48m

You seem to start from a premise that all people are honest

You have to start with that premise otherwise organizations and society fail. Every hour of every day, even people in high security organizations have opportunities to betray the trust bestowed on them. Software and processes are about keeping honest people honest. The dishonest ones you cannot do too much about but hope you limit the damage they can cause.

If everyone is treated as dishonest then there will eventually be an organizational breakdown. Creativity, high productivity, etc... do not work in a low/zero trust environment.

snotrockets
0 replies
9h36m

That’s a lie we tell children so they think the world is fair.

A Marxist reading would suggest alienation, but a more modern one would realize that it is a bit more than that: to enable modern business practices (both good and bad!) we designed systems of management to remove or reduce trust and accountability in the org, yet maintain as similar results to a world that is more in line with the one you believe is possible.

A security professional though would tell you that even in such a world, you can not expect even the most diligent folks to be able to identify all risks (e.g. phishing became so good, even professionals can’t always discern the real from fake), or practice perfect opsec (which probably requires one to be a psychopath).

Voultapher
1 replies
10h24m

Security is a process not a product. Anyone selling you security as a product is scamming you.

These endpoint security companies latch onto people making decisions, those people want security and these software vendors promise to make the process as easy as possible. No need to change the way a company operates, just buy our stuff and you're good. That's the scam.

imiric
0 replies
10h10m

Exactly, well said.

Truthfully, it must be practically infeasible to transform security practices of a large company overnight. Most of the time they buy into these products because they're chasing a security certification (ISO 27001, SOC2, etc.), and by just deploying this to their entire fleet they get to sidestep the actually difficult part.

The irony is that at the end of this they're not anymore "secure" than they were before, but since they have the certification, their customers trust that they are. It's security theater 101.

chii
8 replies
11h27m

whether you morally agree with surveillance software's purpose is not the same as whether a particular piece of surveillence software works well or not.

I would imagine an open source version of crowdstrike would not have had such a bad outcome.

imiric
7 replies
11h15m

I disagree with the concept of surveillance altogether. Computer users should be educated about security, given control of their devices, and trusted that they will do the right thing. If a company can't do that, that's a sign that they don't have good security practices to begin with, and don't do a good job at hiring and training.

The only reason this kind of software is used is so that companies can tick a certification checkbox that gives the appearance of running a tight ship.

I realize it's the easy way out, and possibly the only practical solution for a large corporation, but then this type of issues is unavoidable. Whether the product is free or proprietary makes no difference.

sooper
4 replies
10h55m

Most people do not understand, or care to understand, what "security" means.

You highlight training as a control. Training is expensive - to reduce cost and enhanced effectiveness, how do you focus training on those that need it without any method to identify those that do things in insecure ways?

Additionally, I would say a major function of these systems is not surveillance at all - it is preventive controls to prevent compromise of your systems.

Overall, your comment strikes me a naive and not based on operational experience.

TeMPOraL
3 replies
10h44m

This type of software is notorious for severely degrading employees' ability to do their jobs, occasionally preventing it entirely. It's a main reason why "shadow IT" is a thing - bullshit IT restrictions and endpoint security malware can't reach third-party SaaS' servers.

This is to say, there are costs and threats caused by deploying these systems too, and they should be considered when making security decisions.

jpc0
2 replies
8h18m

Explain exactly how any AV prevents a user from checking e-mails and opening word?

The years I spent doing IT at that level, every time, every single time I got a request for admin privileges to be granted to a user or for software to be installed on an endpoint we already had a solution in place for exactly what the user wanted, installed and tested on their workstation that was taught in onboarding and they simply "forgot".

Just like the users I had to reset their passwords for every monday because they forgot their passwords. It's an irritation but that doesn't mean they didn't do their job well. They met all performance expectations, they just needed to be handheld with technology .

The real world isn't black and white and this isn't Reddit.

TeMPOraL
1 replies
6h54m

Explain exactly how any AV prevents a user from checking e-mails and opening word?

For example by doing continuous scans that consume so much CPU the machine stays thermally throttled at all times.

(Yes, really. I've seen a colleague raising a ticket about AV making it near-impossible to do dev work, to which IT replied the company will reimburse them for a cooling pad for the laptop, and closed the issue as solved.)

The problem is so bad that Microsoft, despite Defender being by far the lightest and least bullshit AV solution, created "dev drive", a designated drive that's excluded by design from Defender scanning, as a blatant workaround for corporate policies preventing users and admins from setting custom Defender exclusions. Before that, your only alternative was to run WSL2 or a regular VM, which are opaque to AVs, but that tends to be restricted by corporate too, because "sekhurity".

And yes, people in these situations invent workarounds, such as VMs, unauthorized third-party SaaS, or using personal devices, because at the end of the day, the work still needs to be done. So all those security measures do is reduce actual security.

kchr
0 replies
5h52m

Most AV and EDR solutions support exceptions, either on specific assets or fleets of assets. You can make exceptions for some employees (for example developers or IT) while keeping (sane) defaults for everybody else. Exceptions are usually applied on file paths, executable image names, file hashes, signature certificates or the complete asset. It sounds like people are applying these solutions wrong, which of course has a negative outcome for everybody and builds distrust.

chrisjj
1 replies
9h6m

Computer users should be educated about security, given control of their devices, and trusted that they will do the right thing.

Imagine you are a bank. Imagine you have no way to ensure no employee is a crook.

It does happen.

matwood
0 replies
2h45m

Imagine you have no way to ensure no employee is a crook.

Wait, are you saying we have gotten rid of all the crooks in a bank/or those that handle money?

echoangle
6 replies
10h39m

If your company is large enough, you can’t really trust your employees. Do you really think google can trust their employees that not a single user does something stupid or even is actively malicious?

iforgotpassword
5 replies
10h31m

Limit their abilities using OS features? Have the vendor fix security issues rather than a third party incompetently slapping on band-aid?

It's like you let one company build your office building and then bring in another contractor to randomly add walls and have others removed while having never looked at the blueprints and then one day "whoopsie, that was a supporting wall I guess".

Why is it not just completely normal but even expected that an OS vendor can't build an OS properly, or that the admins can't properly configure it, but instead you need to install a bunch of crap that fucks around with OS internals in batshit crazy ways? I guess because it has a nice dashboard somewhere that says "you're protected". Checkbox software.

lyu07282
3 replies
9h11m

The sensor basically monitors everything that's happening on the system and then uses heuristics and known attack vectors and behavior to for example then lock compromised systems down. For example a fileless malware that connects to a c&c and then begins to upload all local documents and stored passwords, then slowly enumerates every service the employee has access to for vulnerabilities.

If you manage a fleet of tens of thousands of systems and you need to protect against well funded organized crime? Employees running malicious code under their user is a given and can't be prevented. Buying crowdstrike sensor doesn't seem like such a bad idea to me. What would you do instead?

iforgotpassword
2 replies
3h20m

What would you do instead?

As said, limit the user's abilities as much as possible with features of the OS and software in use. Maybe if you want those other metrics, use a firewall, but not a Tls-breaking virus scanning abomination that has all the same problems, but a simple one that can warn you on unusual traffic patterns. If soneone from accounting starts uploading a lot of data, connects to Google cloud when you don't use any of their products, that should be odd.

If we're talking about organized crime, I'm not convinced crowdstrike in particular doesn't actually enlarge the attack surface. So we had what now as the cause, a malformed binary ruleset that the parser, running with kernel privileges, choked on and crashed the system. Because of course the parsing needs to happen in kernel space and not a sandboxed process. That's enough for me to make assumptions about the quality of the rest of the software, and answer the question regarding attack surface.

Before this incident nobody ever really looked at this product at all from a security standpoint, maybe because it is (supposed to be) a security product and thus cannot have any flaws. But it seems now security researchers all over the planet start looking at this thing and are having a field day.

Bill gates sent that infamous email in the early 2000s, I think after sasser hit the world, that security should be made the no1 priority for Windows. As much as I dislike windows for various reasons, I think overall Microsoft does a rather good job about this. Maybe it's time those companies behind these security products start taking security serious too?

lyu07282
1 replies
1h28m

Before this incident nobody ever really looked at this product at all from a security standpoint

If you only knew how absurd of a statement that is. But in any case, there are just too many threats network IDS/IPS solutions won't help you with, any decent C2 will make it trivial to circumvent them. You can't limit the permissions of your employees to the point of being effective against such attacks while still being able to do their job.

iforgotpassword
0 replies
1h13m

If you only knew how absurd of a statement that is.

You don't seem to know either since you don't elaborate on this. As said, people are picking this apart on Twitter and mastodon right now. Give it a week or two and I bet we'll see a couple CVEs from this.

For the rest of your post you seem to ignore the argument regarding attack surface, as well as the fact that there are companies not using this kind of software and apparently doing fine. But I guess we can just claim they are fully infiltrated and just don't know because they don't use crowdstrike. Are you working for crowdstrike by any chance?

But sure, at the end of the day you're just gonna weigh the damage this outage did to your bottom line and the frequency you expect this to happen with, against a potential hack - however you even come up with the numbers here, maybe crowdstrike salespeople will help you out - and maybe tell yourself it's still worth it.

7952
0 replies
8h53m

In a sense the secure platform already exists. You use web apps as much as possible. You store data in cloud storage. You restrict local file access and execute permissions. Authenticate using passkeys.

The trouble is that people still need local file access, and use network file shares. You have hundreds of apps used by a handful of users that need to run locally. And a few intranet apps that are mission critical and have dubious security. That creates the necessity for wrapping users in firewalls, vpns, tls interception, end point security etc. And the less well it all works the more you need to fill the gaps.

pxc
4 replies
11h29m

I'm curious about this bad 'news' about Kolide. Could you tell me more about your experience with it?

imiric
3 replies
9h43m

I don't have first-hand experience with Kolide, as I refused to install it when it was pushed upon everyone in a company I worked for.

Complaints voiced by others included false positives (flagging something as a threat when it wasn't, or alerting that a system wasn't in place when it was), being too intrusive and affecting their workflow, and privacy concerns (reading and reporting all files, web browsing history, etc.). There were others I'm not remembering, as I mostly tried to stay away from the discussion, but it was generally disliked by the (mostly technical) workforce. Everyone just accepted it as the company deemed it necessary to secure some enterprise customers.

Also, Kolide's whole spiel about "honest security"[1] reeks of PR mumbo jumbo whose only purpose is to distance themselves from other "bad" solutions in the same space, when in reality they're not much different. It's built by Facebook alumni, after all, and relies on FB software (osquery).

[1]: https://honest.security/

DrRobinson
2 replies
7h32m

I think some of the information here is misleading and a bit unfair.

being too intrusive and affecting their workflow

Kolide is a reporting tool, it doesn't for example remove files or put them in quarantine. You also cannot execute commands remotely like in Crowdstrike. As you mentioned, it's based on osquery which makes it possible to query machine information using SQL. Usually, Kolide is configured to send a Slack message or email if there is a finding, which I guess can be seen as intrusive but IMO not very.

reading and reporting all files

It does not read and report all files as far as I know, but I think it's possible to make SQL queries to read specific files. But all files or file names aren't stored in Kolide or anything like that. And that live query feature is audited (ens users can see all queries run against their machines) and can be disabled by administrators.

web browsing history

This is not directly possible as far as I know, but maybe via a file read query but it's not something built-in out of the box/default. And again, custom queries are transparent to users and can be disabled.

Kolide's whole spiel about "honest security"[1] reeks of PR mumbo jumbo whose only purpose is to distance themselves from other "bad" solutions in the same space

While it's definitely a PR thing, they might still believe in it and practice what they preach. To me it sounds like a good thing to differentiate oneself from bad actors.

Kolide gives users full transparency of what data is collected via their Privacy Center, and they allow end users to make decisions about what to do about findings (if anything) rather than enforcing them.

It's built by Facebook alumni, after all, and relies on FB software (osquery).

For example React and Semgrep is also built by Facebook/Facebook alumni, but I don't really see the relevance other than some ad-hominem.

Full disclosure: No association with Kolide, just a happy user.

madeofpalk
0 replies
7h17m

Great news - Kolide has a new integration with Okta that'll prevent you from logging into anything if Kolide has a problem with your device!

imiric
0 replies
2h17m

I concede that I may be unreasonably biased against Kolide because of the type of software it is, but I think you're minimizing some of these issues. My memory may be vague on the specifics, but there were certainly many complaints in the areas I mentioned in the company I worked at.

That said, since Kolide/osquery is a very flexible product, the complaints might not have been directed at the product itself, but at how it was configured by the security department as well. There are definitely some growing pains until the company finds the right balance of features that everyone finds acceptable.

Re: intrusiveness, it doesn't matter that Kolide is a report-only tool. Although, it's also possible to install extensions[1,2] that give it a deeper control over the system.

The problem is that the policies it enforces can negatively affect people's workflow. For example, forcing screen locking after a short period of inactivity has dubious security benefits if I'm working from a trusted environment like my home, yet it's highly disruptive. (No, the solution is not to track my location, or give me a setting I have to manage...) Forcing automatic system updates is also disruptive, since I want to update and reboot at my own schedule. Things like this add up, and the combination of all of them is equivalent to working in a babyproofed environment where I'm constantly monitored and nagged about issues that don't take any nuance into account, and at the end of the day do not improve security in the slightest.

Re: web browsing history, I do remember one engineer looking into this and noticing that Kolide read their browser's profile files, and coming up with a way to read the contents of the history data in SQLite files. But I am very vague on the details, so I won't claim that this is something that Kolide enables by default. osquery developers are clearly against this kind of use case[3]. It is concerning that the product can, in theory, be exploited to do this. It's also technically possible to pull any file from endpoints[4], so even if this is not directly possible, it could easily be done outside of Kolide/osquery itself.

Kolide gives users full transparency of what data is collected via their Privacy Center

Honestly, why should I trust what that says? Facebook and Google also have privacy policies, yet have been caught violating their users' privacy numerous times. Trust is earned, not assumed based on "trust me, bro" statements.

For example React and Semgrep is also built by Facebook/Facebook alumni, but I don't really see the relevance other than some ad-hominem.

Facebook has historically abused their users' privacy, and even has a Wikipedia article about it.[5] In the context of an EDR system, ensuring trust from users and handling their data with the utmost care w.r.t. their privacy are two of the most paramount features. Actually, it's a bit silly that Kolide/osquery is so vocal in favor of preserving user privacy, when this goes against working with employer-owned devices where employee privacy is definitely not expected. In any case, the fact this product is made by people who worked at a company built by exploiting its users is very relevant considering the type of software it is. React and Semgrep have an entirely different purpose.

[1]: https://github.com/trailofbits/osquery-extensions

[2]: https://github.com/hippwn/osquery-exec

[3]: https://github.com/osquery/osquery/issues/7177

[4]: https://osquery.readthedocs.io/en/stable/deployment/file-car...

[5]: https://en.wikipedia.org/wiki/Privacy_concerns_with_Facebook

ironbound
4 replies
10h13m

Next you'll be saying "I dont need an immune system..."

Fun fact an attacker only needs to steal credentials from the home directory to jump into a companies AWS account where all the juicy customer data lives, so there are reasons we want this control.

Frankly I'd like to see the smart people complaining help write better solutions rather than hinder.

pavel_pt
3 replies
9h49m

If that’s all it takes an attacker, you’re doing AWS wrong.

snotrockets
0 replies
9h30m

Problem is that many do.

Doing it right requires very capable individuals and a significant effort. Less than it used to take, more than most companies are ready to invest.

ironbound
0 replies
5h42m

people get lazy

hello_moto
0 replies
3h1m

This is the real world, everyone is doing something wrong.

The alternative is to replace you with AI yes?

matheusmoreira
20 replies
11h56m

There are no "ethical standards" to move to. Nobody should be able to usurp control of our computers. That should simply be declared illegal. Creating contractual obligations that require people to cede control of their computers should also be prohibited. Anything that does this is malware and malware does not become justified or "ethical" when some corporation does it. Open source malware is still malware.

cqqxo4zV46cp
9 replies
10h32m

Oh stop it. It’s not your machine, it’s your employer’s machine. You’re the user of the machine. You’re cargo-culting some ideological take that doesn’t apply here at all.

imiric
8 replies
9h18m

It’s not your machine, it’s your employer’s machine.

Agreed. I'm fine with this, as long as the employer also accepts that I will never use a personal device for work, that I will never use a minute of personal time for work, and that my productivity is significantly affected by working on devices and systems provided and configured by the employer. This knife cuts both ways.

fragmede
6 replies
9h6m

If only that were possible. Luckily for my employer, I end up thinking about problems to be solved during my off hours like when I'm sleeping and in the shower. Then again, I also think about non-work life problems sitting at my desk when I'm supposed to be working, so (hopefully) it evens out.

imiric
5 replies
7h44m

I don't think it's possible either. But the moment my employer forces me to install a surveillance rootkit on the machine I use for work—regardless of who owns the machine—any trust that existed in the relationship is broken. And trust is paramount, even in professional settings.

valicord
2 replies
4h35m

Setting aside the question whether these security tools are effective at their stated goal, what does this have to do with trust at all? Does the existence of a bank vault break the trust between the bank and the tellers? What is the mechanism that would prevent your computer from getting infected by a 0-day if only your employer trusted you?

imiric
1 replies
1h59m

Does the existence of a bank vault break the trust between the bank and the tellers?

That's a strange analogy, since the vault is meant to safeguard customer assets from the public, not from bank employees. Besides, the vault doesn't make the teller's job more difficult.

What is the mechanism that would prevent your computer from getting infected by a 0-day if only your employer trusted you?

There isn't one. What my employer does is trust that I take care of their assets and follow good security practices to the best of my abilities. Making me install monitoring software is an explicit admission that they don't trust me to do this, and with that they also break my trust in them.

valicord
0 replies
1h30m

You mean like AV software is meant to safeguard the computer from malware? I'm sure banks have a lot of annoying security related processes that make teller's job more difficult.

mr_mitm
1 replies
4h13m

If you don't already have an anti virus on your work machine, you're in a extremely small minority. As a consultant with projects that go about a week, I've experienced the onboarding process of over a hundred orgs first hand. They almost all hand out a Windows laptop, and every single Windows laptop had an AV on it. It's considered negligent not to have some AV solution in the corporate world. And these days, almost all the fancy AVs live in the kernel.

imiric
0 replies
2h12m

I don't doubt that to be the case, but I'm happy to not work in corporate environments (anymore...). :)

kchr
0 replies
5h39m

My experience is that in these workplaces where EDR is enforced on all devices used for work, your hypothetical is true (i.e. you are not expected to work on devices not provided by your employer - on the contrary, that is most likely forbidden).

callalex
8 replies
11h33m

What does “our computer” mean when it is not owned by you, but issued to you to perform a task with by your employer? Does that also apply to the operator at a switchboard in a nuclear missile launch facility?

derefr
6 replies
2h38m

What does “our computer” mean when it is not owned by you, but issued to you to perform a task with by your employer?

Well, presuming that:

1. the employee is issued a computer, that they have possession of even if not ownership (i.e. they bring the computer home with them, etc.)

2. and the employee is required to perform creative/intellectual labor activities on this computer — implying that they do things like connecting their online accounts to this computer; installing software on this computer (whether themselves or by asking IT to do it); doing general web-browsing on this computer; etc.

3. and where the extent of their job duties, blurs the line between "work" and "not work" (most salaried intellectual-labor jobs are like this) such that the employee basically "lives in" this computer, even when not at work...

4. ...to the point that the employee could reasonably conclude that it'd be silly for them to maintain a separate "personal" computer — and so would potentially sell any such devices (if they owned any), leaving them dependent on this employer-issued computer for all their computing needs...

...then I would argue that, by the same chain of reasoning as in the GP post, employers should not be legally permitted to “issue” employees such devices.

Instead, the employer should either purchase such equipment for the employee, giving it to them permanently as a taxable benefit; or they should require that the employee purchase it themselves, and recompense them for doing so.

Cyberpunk analogy: imagine you are a brain in a vat. Should your employer be able to purchase an arbitrary android body for you; make you use it while at work; and stuff it full of monitoring and DRM? No, that'd be awful.

Same analogy, but with the veil stripped off: imagine you are paraplegic. Should your employer be allowed to issue you an arbitrary specific wheelchair, and require you to use it at work, and then monitor everything you do with it / limit what you can do with it because it’s “theirs”? No, that’d be ridiculous. And humanity already knows that — employers already can't do that, in any country with even a shred of awareness about accessibility devices. The employer — or very much more likely, the employer's insurance provider — just buys the person the chair. And then it's the employee's chair.

And yes, by exactly the same logic, this also means that issuing an employee a company car should be illegal — at least in cases where the employee lives in a non-walkable area, and doesn't already have another car (that they could afford to keep + maintain + insure); and/or where their commute is long enough that they'd do most non-employment-related car-requiring things around work and thus using their company car. Just buy them a car. (Or, if you're worried they might run away with it, then lease-to-own them a car — i.e. where their "equity in the car" is in the form of options that vest over time, right along-side any equity they have in the company itself.)

Does that also apply to the operator at a switchboard…

Actually, no! Because an operator of a switchboard is not a “user” of the computer that powers the switchboard, in the same sense that a regular person sitting at a workstation is a "user" of the workstation.

The system in this case is a “kiosk computer”, and the operator is performing a prescribed domain-specific function through a limited UX they’re locked into by said system. The operator of a nuclear power plant is akin to a customer ordering food from a fast-food kiosk — just providing slightly more mission-critical inputs. (Or, for a maybe better analogy: they're akin to a transit security officer using one of those scanner kiosk-handhelds to check people's tickets.)

If the "computer" the nuclear-plant operator was operating, exposed a purely electromechanical UX rather than a digital one — switches and knobs and LEDs rather than screens and keyboards[1] — then nothing about the operator's workflow would change. Which means that the operator isn't truly computing with the computer; they're just interacting with an interface that happens to be a computer.

[1] ...which, in fact, "modern" nuclear plants are. The UX for a nuclear power plant control-center has not changed much since the 1960s; the sort of "just make it a touchscreen"-ification that has infected e.g. automotive has thankfully not made its way into these more mission-critical systems yet. (I believe it's all computers under the hood now, but those computers are GPIO-relayed up to panels with lots and lots of analogue controls. Or maybe those panels are USB HID devices these days; I dunno, I'm not a nuclear control-systems engineer.)

Anyway, in the general case, you can recognize these "the operator is just interacting with an interface, not computing on a computer" cases because:

• The machine has separate system administrators who log onto it frequently — less like a workstation, more like a server.

• The machine is never allowed to run anything other than the kiosk app (which might be some kind of custom launcher providing several kiosk apps, but where these are all business-domain specific apps, with none of them being general-purpose "use this device as a computer" apps.)

• The machine is set up to use domain login rather than local login, and keeps no local per-user state; or, more often, the machine is configured to auto-login to an "app user" account (in modern Windows, this would be a Mandatory User Profile) — and then the actual user authentication mechanism is built into the kiosk app itself.

Hopefully, the machine is using an embedded version of the OS, which has had all general-purpose software stripped out of it to remove vulnerability surface.

valicord
4 replies
1h18m

the employee could reasonably conclude that it'd be silly for them to maintain a separate "personal" computer — and so would potentially sell any such devices

What a bizarre leap of logic. Can Fedex employees reasonably sell their non-uniform clothes? Just because the employer in this scenario didn't 100% lock down the computer (which is a good thing because the alternative would be incredibly annoying for day-to-day work), doesn't mean the the employee can treat it as their own. Even from the privacy perspective, it would be pretty silly. Are you going to use the employer provided computer to apply to your next job?

derefr
3 replies
57m

People do do it, though. Especially poor people, who might not use their personal computers very often.

Also, many people don't own a separate "personal" computer in the first place. Especially, again, poor people. (I know many people who, if needing to use "a PC" for something, would go to a public library to use the computers there.)

Not every job is a software dev position in the Bay Area, where everyone has enough disposable income to have a pile of old technology laying around. Many jobs for which you might be issued a work laptop still might not pay enough to get you above the poverty line. McDonald's managers are issued work laptops, for instance.

(Also, disregarding economic class for a moment: in the modern day, most people who aren't in tech solve most of their computing problems by owning a smartphone, and so are unlikely to have a full PC at home. But their phone can't do everything, so if they have a work computer they happen to be sat in front of for hours each day — whether one issued to them, or a fixed workstation at work — then they'll default to doing their rare personal "productivity" tasks on that work computer. And yes, this does include updating their CV!)

---

Maybe you can see it more clearly with the case of company cars.

People sometimes don't own any other car (that actually works) until they get issued a company car; so they end up using their company car for everything. (Think especially: tradespeople using their company-logo-branded work box-truck for everything. Where I live, every third vehicle in any parking lot is one of those.)

And people — especially poorer people — also often sell their personal vehicle when they are issued a company car, because this 1. releases them from the need to pay a lease + insurance on that vehicle, and 2. gets them possibly tens of thousands of dollars in a lump sum (that they don't need to immediately reinvest into another car, because they can now rely on the company car.)

valicord
2 replies
51m

The point is that if you do do it, it's on you to understand the limitations of using someone else property. Just like the difference between rental vs owned housing.

There are also fairly obvious differences between work-issued computers and all of your other analogies:

1. A car (and presumably the cyberpunk android body) is much more expensive than a computer, so the downside of owning both a personal and a work one is much higher.

2. A chair or a wheel chair doesn't need security monitoring because it's a chair (I guess you could come up with an incredibly convoluted scenario where it would make sense to put GPS tracking in a wheelchair, but come on).

just buys the person the chair. And then it's the employee's chair.

It's not because there's a law against loaning chairs, it's because the chair is likely customized for a specific person and can't be reused. Or if you're talking about WFH scenarios, they just don't want to bother with return shipping.

derefr
1 replies
43m

No, it's the difference between owned housing vs renting from a landlord who is also your boss in a company town, where the landlord has a vested interest in e.g. preventing you from using your apartment to also do work for a competitor.

Which is, again, a situation so shitty that we've outlawed it entirely! And then also imposed further regulations on regular, non-employer landlords, about what kinds of conditions they can impose on tenants. (E.g. in most jurisdictions, your landlord can't restrict you from having guests stay the night in your room.)

Tenants' rights are actually a great analogy for what I'm talking about here. A company-issued laptop is very much like an apartment, in that you're "living in it" (literally and figuratively, respectively), and that you therefore should deserve certain rights to autonomous possession/use, privacy, freedom from restriction/compromise in use, etc.

While you don't literally own an apartment you're renting, the law tries to, as much as possible, give tenants the rights of someone who does own that property; and to restrict the set of legal justifications that a landlord can use to punish someone for exercising those (temporary) rights over their property.

IMHO having the equivalent of "tenants' rights" for something like a laptop is silly, because that'd be a lot of additional legal edifice for not-much gain. But, unlike with real-estate rental, it'd actually be quite practical to just make the "tenancy" case of company IT equipment use impossible/illegal — forcing employers to do something else instead — something that doesn't force employees into the sort of legal area that would make "tenants' rights" considerations applicable in the first place.

valicord
0 replies
35m

No, that would be more like sleeping at the office (purely because of employee preferences, not because the employer forces you to or anything like that) and complaining about security cameras.

derefr
0 replies
1h23m

Tangent — a question you didn't ask, but I'll pretend you did:

If employers allowed employees to "bring their own devices", and then didn't force said employees to run MDM software on those devices, then how in the world could the employer guarantee the integrity of any line-of-business software the employee must run on the device; impose controls to stop PII + customer-shared data + trade secrets from being leaked outside the domain; and so forth?

My answer to that question: it's safe to say that most people in the modern day are fine with the compromise that your device might be 100% yours most of the time; but, when necessary — when you decide it to be so — 99% yours, 1% someone else's.

For example, anti-cheat software in online games.

The anti-cheat logic in online games, is this little nugget of code that runs on a little sub-computer within your computer (Intel SGX or equivalent.) This sub-computer acts as a "black box" — it's something the root user of the PC can't introspect or tamper with. However:

• Whenever you're not playing a game, the anti-cheat software isn't loaded. So most of the time, your computer is entirely yours.

You get to decide when to play an online game, and you are explicitly aware of doing so.

• When you are playing an online game, most of your computer — the CPU's "application cores", and 99% of the RAM — is still 100% under your control. The anti-cheat software isn't actually a rootkit (despite what some people say); it can't affect any app that doesn't explicitly hook into it.

• In a brute-force sense, you still "control" the little sub-computer as well — in that you can force it to stop running whatever it's running whenever you want. SGX and the like aren't like Intel's Management Engine (which really could be used by a state actor to plant a non-removable "ring -3" rootkit on your PC); instead, SGX is more like a TPM, or an FPGA: it's something that's ultimately controlled by the CPU from ring 0, just with a very circumscribed API that doesn't give the CPU the ability to "get in the way" of a workload once the CPU has deployed that workload to it, other than by shutting that workload off.

As much as people like Richard Stallman might freak out at the above design, it really isn't the same thing as your employer having root on your wheelchair. It's more like how someone in a wheelchair knows that if they get on a plane, then they're not allowed to wheel their own wheelchair around on the plane, and a flight attendant will instead be doing that for them.

How does that translate to employer MDM software?

Well, there's no clear translation currently, because we're currently in a paradigm that favors employer-issued devices.

But here's what we could do:

• Modern PCs are powerful enough that anything a corporation wants you to do, can be done in a corporation-issued VM that runs on the computer.

• The employer could then require the installation of an integrity-verification extension (essentially "anti-cheat for VMs") that ensures that the VM itself, and the hypervisor software that runs it, and the host kernel the hypervisor is running on top of, all haven't been tampered with. (If any of them were, then the extension wouldn't be able to sign a remote-attestation packet, and the employer's server in turn wouldn't return a decryption key for the VM, so the VM wouldn't start.)

• The employer could feel free to MDM the VM guest kernel — but they likely wouldn't need to, as they could instead just lock it down in much-more-severe ways (the sorts of approaches you use to lock down a server! or a kiosk computer!) that would make a general-purpose PC next-to-useless, but which would be fine in the context of a VM running only line-of-business software. (Remember, all your general-purpose "personal computer" software would be running outside the VM. Web browsing? Outside the VM. The VM is just for interacting with Intranet apps, reading secure email, etc.)

(Why yes, I am describing https://en.wikipedia.org/wiki/Multilevel_security.)

z3phyr
0 replies
9h12m

Does the switchboard in a nuclear missile launch facility run Crowdstrike? I picture it as a high quality analog circuit board that does 1 thing and 1 thing only. No way to run anything else.

Globally networked personal computers were kind of cultural revolution against the setting you describe. Everyone had their own private compute and compute time and everyone could share their own opinion. Computers became our personal extensions. This is what IBM, Atari, Commodore, Be, Microsoft and Apple (and later desktop Linux) sold. Now given this ideology, can a company own my limbs? If not, they can't own my computers.

eptcyka
0 replies
11h28m

Yes, that is why the owners of the computers (corps) use these tools - to maintain control over their hardware (and IP accessible on it). The end user is not the customer or user here.

giantpotato
5 replies
9h21m

By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards.

As a red teamer developing malware for my team to evade EDR solutions we come across, I can tell you that EDR systems are essential. The phrase "root kit powered endpoint surveillance" is a mischaracterization, often fueled by misconceptions from the gaming community. These tools provide essential protection against sophisticated threats, and they catch them. Without them, my job would be 90% easier when doing a test where Windows boxes are included.

So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs.

Open-source EDR solutions, like OpenEDR [1], exist but are outdated and offer poor telemetry. Assembling various GitHub POCs that exist for production EDR is impractical and insecure.

The EDR sensor itself becomes the targeted thing. As a threat actor, the EDR is the only thing in your way most of the time. Open sourcing them increases the risk of attackers contributing malicious code to slow down development or introduce vulnerabilities. It becomes a nightmare for development, as you can't be sure who is on the other side of the pull request. TAs will do everything to slow down the development of a security sensor. It is a very adversarial atmosphere.

On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.

It is actually the other way around. Open-source malware heuristic rules do exist, such as Elastic Security's detection rules [2]. Elastic also provides EDR solutions that include kernel drivers and is, in my experience, the harder one to bypass. Again, please make an EDR without drivers for Windows, it makes my job easier.

*It could be audited by the public."

The EDR sensors already do get "audited" by security researchers and the threat actors themselves. Reverse engineering and debugging the EDR sensors to spot weaknesses that can be "abused." If I spot things like the EDR just plainly accepting kernel mode shellcode and executing it, I will, of course, publicly disclose that. EDR sensors are under a lot of scrutiny.

[1] https://github.com/ComodoSecurity/openedr [2] https://github.com/elastic/detection-rules

manquer
4 replies
8h32m

Open sourcing them increases the risk of attackers contributing malicious code to slow down development or introduce vulnerabilities.

This is a such tired non-sequitur argument with no evidence whatsoever to back it up that the risk is actually higher for open source versus closed source.

I can just easily argue that a state or non-state actor could buy[1], bribe or simply threaten to get weak code in a proprietary system, without users having any means to ever find out. On the other hand, it is always easier(easier not easy) to discover compromise in open-source like it happened with xz[2] and verify such reports independently.

If there is no proof that compromise is less likely with closed source and it is far easier to discover them in open-source, the logical conclusion is simply open source is better for security libraries.

Funding defensive security infrastructure which is open source and freely available for everyone to use even with 1/100th of the NSA budget that is effectively only offensive, would improve info-security enormously for everyone not just from nation state actors, but also from scammers etc. Instead we get companies like CS that have enormous vested interest in seeing that never happens and trying to scare the rest of us that open-source is bad for security.

[1] https://en.wikipedia.org/wiki/Dual_EC_DRBG

[2] https://en.wikipedia.org/wiki/XZ_Utils_backdoor

mardifoufs
1 replies
7h53m

I could see an open source solution with "private" or vendor specific definition files. But I think I'd disagree with the statement that open sourcing everything wouldn't cause any problem. Engineering isn't necessarily about peer reviewed studies, it's about empirical observations and applying the engineering method (which can be complemented by a more scientific one but shouldn't be confused for it). It's clear that this type of stuff is a game of cat and mouse. Attackers search for any possible vulnerability, bypass etc. It does make sense that exposing one side's machinery will make it easier for the other side to see how it works. A good example of that is how active hackers are at finding different ways to bypass Windows Defender by using certain types of Office file formats, or certain combinations of file conversions to execute code. Exposing the code would just make all of those immediately visible to everyone.

Eventually that's something that gets exposed anyways, but I think the crucial part is timing and being a few steps ahead in the cat and mouse game. Otherwise I'm not sure what kind of proof would even be meaningful here.

manquer
0 replies
4h43m

open sourcing everything wouldn't cause any problem

That is not what am saying, I am saying open sourcing doesn’t cause more problems than proprietary systems which is the argument OP was making .

Open source is not a panacea, it is just not objectively worse as OP implies.

jpc0
1 replies
8h5m

I have a different take on this.

I feel having the solution open sourced isn't bad from a code security standpoint, but rathee that it is simply not economically viable. To my knowledge most of the major open source technologies are currently funded by FAANG and purely because it's needed by them to conduct business and the moment it becomes inconvenient for them to support it they fork it or develop their own, see Terraform/Redis...

I also cannot get behind a government funding model purely because it will simply become a design by committee nightmare because this isn't flashy tech. Just see how many private companies have beaten NASA to market in a pretty well funded and very flashy industry. The very government you want to fund these solutions are currently running on private companies infrastructure for all their IT needs.

Yes opensouring is definitely amazing and if executed well will be better, just like communism.

manquer
0 replies
4h49m

Plenty of fundamental research and development happens in academia fairly effectively.

Government has to fund not run it like any other grant works today. The existing foundations and non profits like Apache or even mixed ones like Mozilla are fairly capable of handling the grants.

Expecting private companies or dedicated volunteers to maintain mission critical libraries like xz is not a viable option as we are doing it now.

plantain
1 replies
11h46m

There is an open source alternative. GRR:

https://github.com/google/grr

Every Google client device has it.

G3rn0ti
0 replies
10h54m

It sounds really interesting. But the only thing it does not do is scanning for vira/malwares, although this could be implemented using GRR I guess. How does Google mitigate malware threats in-house?

intelVISA
1 replies
7h45m

Security isn't really a product you can just buy or outsource, but here we are.

kemotep
0 replies
7h9m

Crowdstrike is a gun. A tool. But not the silver bullet. Or training to be able to fire it accurately under pressure at the werewolf.

You can very easily shoot your own foot off instead of slaying the monster, use the wrong ammunition to be effective, or in this case a poorly crafted gun can explode in your hand when you are holding it.

ymck
0 replies
40m

There are a number of OSS EDRs. They all suck.

DAT-style content updates and signature-based prevention are very archaic. Directly loading content into memory and a hard-coded list of threats? I was honestly shocked that CS was still doing DAT-style updates in an age of ML and real-time threat feeds. There are a number of vendors who've offered it for almost a decade. We use one. We have to run updates a couple of times a year.

SMH. The 90's want their endpoint tech back.

ndr_
0 replies
9h19m

There used to be Winpooch Watchguard, based on ClamAV. Stopped using it when it caused Bluescreens. A "Killer" indeed.

cedws
0 replies
5h33m

The value CrowdStrike provides is the maintenance of the signature database, and being able to monitor attack campaigns worldwide. That takes a fair amount of resources that an open source project wouldn’t have. It’s a bit more complicated than a basic hash lookup program.

golemiprague
49 replies
16h4m

But how come they didn't catch it in the testing deployments? what was the difference that caused it to happen when they deployed to the outside world. I find it hard to believe that they didn't test it before deployment. I also think companies should all have a testing environment before deploying 3rd party components. I mean, we all install some packages during development that fails or cause some problems but nobody think it is a good idea to do it directly in their production environment before testing, so how is this different?

jmb99
30 replies
14h20m

I find it hard to believe that they didn't test it before deployment.

I’m not sure why you find that hard to believe - based on the (admittedly fairly limited) evidence we have right now, it’s highly unlikely that this deployment was tested much, if at all. It seems much more likely to me that they were playing fast and loose with definition updates to meet some arbitrary SLAs[1] on zero-day prevention, and it finally caught up with them. Much more likely than somehow every single real-world pc running their software being affected but their test machines somehow all impervious.

[1] When my company was considering getting into endpoint security and network anomaly detection, we were required on multiple occasions by multiple potential clients to provide a 4-hour SLA on a wide number of CVE types and severities. That would mean 24/7 on-call security engineers and a sub-4-hour definition creation and deployment. Yes, that 4 hours was for the deployment being available on 100% of the targets. Good luck writing and deploying a high-quality definition for a zero day in 4 hours, let alone running it through a test pipeline, let alone writing new tests to actually cover it. We very quickly noped out of the space, because that was considered “normal” (at least to the potential clients we were discussing). It wouldn’t shock me if CS was working in roughly the same way here.

drooopy
22 replies
12h49m

This whole f*up was a failure of management and processes at Crowdstrike. "Intern Steve" pushing faulty code to production on a Friday is only a couple of cm of the tip of an enormous iceberg.

chronid
20 replies
12h6m

I wrote this in another thread already, but the fuck up was both at crowdstrike (they borked a release) but also and more importantly their customers. Shit happens even with the best testing in the world.

You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that. It's madness and we're not talking about small companies with tiny IT departments here.

d1sxeyes
6 replies
11h6m

That’s a tricky one. CrowdStrike is cybersecurity. Wait until the first customer complains that they were hit by WannaCry v2 because CrowdStrike wanted to wait a few days after they updated a canary fleet.

The problem here is that this type of update (a content update) should never be able to cause this however badly it goes. In case the software receives a bad content update, it should fail back to the last known good content update (potentially with a warning fired off to CS, the user, or someone else about the failed update).

In principle, updates that could go wrong and cause this kind of issue should absolutely be deployed slowly, but per my understanding, that’s already the practice for non-content updates at CrowdStrike.

chronid
5 replies
10h51m

Windows updates are also cybersecurity, but the customer has (had?) a choice to how to roll those out (with Intune nowadays?). The customer should decide when to update, they own the fleet not the vendor!

You do not know if a content update will screw you over and mark all the files of your company as malware. The "It should never happen" situations are the thing you need to prepare for, the reason we talk about security as an onion, the reason we still do staggered production releases with baking times even after tests and QA have passed...

"But it's cybersecurity" is not a justification. I know that security departments and IT departments and companies in general love dropping the "responsibility" part on someone else, but in the end of the day the thing getting screwed over is the company fleet. You should retain control and make sure things work properly, the fact those billion dollar revenue companies are unable to do so is a joke. A terrible one, since IT underpins everything nowadays.

chrisjj
4 replies
8h16m

The customer should decide when to update, they own the fleet not the vendor!

The CS customer has decided to update whenever 24/7 CS says. The alternative is to arrive on Monday morning to an infected fleet.

chronid
3 replies
7h56m

Sorry, this is untrue. Enterprises have SOCs and oncalls, if there is a high risk they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate) and then fleet rollout. It would have been rolled out by Friday evening in this case without crashing hundred of thousands of servers.

The CS customer has decided to offload the responsibility of its fleet to CS. In my opinion that's bullshit and negligence (it doesn't mean I don't understand why they did it), particularly at the scale of some of the customers :)

chrisjj
1 replies
4h50m

they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate)

Incorrect, I believe, given they did not and could not get advance sight of the offending forced update.

Kwpolska
0 replies
4h29m

I doubt CrowdStrike had done any testing of the update.

chrisjj
0 replies
4h47m

they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate)

Incorrect, I believe, given they could and did not get advance sight of the offending forced update.

stef25
3 replies
11h3m

You'd think that the software would sit in a kind of sandbox so that it couldn't nuke the whole device but only itself. It's crazy that this is possible.

echoangle
2 replies
10h36m

The software basically works as a kernel module as far as I understand, I don’t think there’s a good way to separate that from the OS while still allowing it to have the capabilities it needs to have to surveil all other processes.

temac
0 replies
8h37m

Something like ebpf.

layer8
0 replies
15m

And even then, you wouldn’t want the system to continue running if the security software crashes. Such a crash might indicate a successful security breach.

owl57
2 replies
11h20m

> you do not buy software that does that

Note how the incident disproportionally affected highly regulated industries, where businesses don't have a choice to screw "best practice".

TeMPOraL
1 replies
10h36m

Only highlighting that "best practice" of cybersecurity is, charitably, total bullshit; less charitably, a racket. This is apparent if you look at the costs to the day-to-day ability of employees to do work, but maybe it'll be more apparent now that people got killed because of it.

badgersnake
0 replies
9h35m

It’s absolutely a racket.

KaiserPro
1 replies
10h12m

You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that

I am sympathetic to that, but its only possible if both policy and staffing allow.

for policy, there are lots of places that demand CVEs be patched within x hours depending on severity. A lot of times, that policy comes from the payment integration systems provider/third party.

However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.

chronid
0 replies
9h1m

I am sympathetic to that, but its only possible if both policy and staffing allow.

We are not talking about small companies here. We're talking about massive billion revenue enterprises with enormous IT teams and in some cases multiple NOCs and SOCs and probably thousands consultants all around at minimum.

I find it hard to be sympathetic to this complete disregard of ownership just to ship responsibility somewhere else (because this is the need at the of the day let's not joke around). I can understand it, sure, and I can believe - to a point - someone did a risk calculation (possibility of crowdstrike upgrade killing all systems vs hack if we don't patch a CVE in <4h), but it's still madness from a reliability standpoint.

for policy, there are lots of places that demand CVEs be patched within x hours depending on severity.

I'm pretty sure leadership when they need to choose between production being down for an unspecified amount of time and taking the risk of delaying (of hours in this case) the patching will choose the delay. Partners and payment integration providers can be reasoned with, contracts are not code. A BSOD you cannot talk away.

Sure, leadership is also now saying "but we were doing the same thing as everyone else, the consultants told us to and how could have we have known this random software with root on every machine we own could kill us?!" to cover their asses. The problem is solved already, since it impacted everyone, and they're not the ones spending their weekend hammering systems back to life.

However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.

You choose what to install on your systems, and you have the option to refuse to engage with companies that don't provide such options. If you don't, you accept the risk.

sateesh
0 replies
9h41m

Disagree with the part where you put onus on customer. As has been mentioned in other HN thread [1], this update was pushed ignoring whatever the settings customer had configured. The original mistake of the customer, if any, was they didn't read this in fine print of the contract (if this point about updates was explicitly mentioned in the contract). 1. https://news.ycombinator.com/item?id=41003390

perbu
0 replies
11h56m

Shit might happen with the best testing, but with decent testing it would not be this serious.

chrisjj
0 replies
8h14m

You do not deploy anything, ever on your entire production fleet at the same time

And if an attacker does??

jmb99
0 replies
11h39m

Oh absolutely. There’s many levels of failure here. A few that I see as being likely:

- Lack of testing of a deployment - Lack of required procedures to validate a deployment - Engineering management prioritizing release pace over stability/testing - Management prioritizing tech debt/pentests/etc far too low - Sales/etc promising fast turnarounds that can’t be feasibly met while following proper standards - Lack of top-down company culture of security and stability first, which should be a must for any security company

This outage wasn’t caused only by “the intern pushing release.” It was caused by a poor company culture (read: incorrect direction from the top) resulting in a lack of testing of the program code, lack of testing environment for deployments, lack of formal deployment process, and someone messing up a definition file that was caught by 0 other employees or automated systems.

qaq
3 replies
12h8m

While true agent should roll back to previous content version if it keeps crashing

Kwpolska
2 replies
11h8m

Detecting system crashes would be hard. You could try logging and comparing timestamps on agent startups and see if the difference is 5 minutes or less. Buggy kernel drivers crash Windows hard and fast.

qaq
0 replies
9h59m

loading content is pretty specific step so your solution is more or less valid

kchr
0 replies
1h23m

Detecting system crashes would be hard.

Store something like an `attemptingUpdate` flag before updating, and remove it if the update was successful. Upon system startup, if the flag is present, revert to the previous config and mark the new config bad.

_moof
2 replies
12h24m

I can't speak to its veracity but there's a screenshot making its way around in which Crowdstrike discouraged sites from testing due to the urgency of the update.

jmb99
0 replies
11h35m

It’s kind of hard to pitch “zero-day prevention” if you suggest people roll out definitions slowly, over the course of days/weeks. Thus making it a lot harder to charge to the moon for your service.

Now, if these sorts of things were battle tested before release, and had a (ideally decade+-long) history of stability with well-documented processes to ensure that stability, you can more easily make the argument that it’s worth it. None of those things are close to true though (and more than likely will never be for any AV/endpoint solution), so it is very hard to justify this sort of configuration.

AmericanChopper
0 replies
11h59m

I don’t work with CS products atm, but my experience with a big CS deployment was exactly like this. They were openly quite hostile to any suggestion of testing their products, we were frequently rebuked for running our prod censors on version n-1. I talked about it a bit in this comment.

https://news.ycombinator.com/item?id=%2041002864

Very much not surprised to see this now.

albert_e
5 replies
11h55m

My guess -- there are two separate pipelines one for code changes and one for data files.

Pipeline 1 --

Code updates to their software are treated as material changes that require non-production and canary testing before global roll-out of a new "Version".

Pipeline 2 --

Content / channel updates are handled differently -- via a separate pipeline -- because only new malware signatures and the like are distrubuted via this route. The new files are just data files -- they are supposed to be in a standard format and only read, not "executed".

This pipeline itself must have been tested originally and found tobe working satisfactorily -- but inside the pipeline there is no "test" stagethat verifies the integrity of the data fine so generated, nor - more importantly - checking if this new data file works without errors when deployed to the latest versions of the software in use.

The agent software that reads these daily channel files must have been "thoroughly" tested (as part of pipeline 1) for all conceivable data file sizes and simulated contents before deployment. (any invalid data files should simply be rejected with an error ... "obviously")

But the exact scenario here -- possibly caused by a broken pipeline in the second path (pipeline 2) -- created invalid data files with some quirks. And THAT specific scenario was not imagined or tested in the software version dev-test-deploy pipeine (pipeline 1).

If this is true --

The lesson obviously is that even for "data" only distributions and roll-outs, however standardized and stable their pipelines may be, testing is still an essential part before large scale roll-outs. It will increase cost and add latency sure, but we have to live with it. (similar to how people pay for "security" software in the first place)

Same lesson for enterprise customers as well -- test new distributions on non-production within your IT setup, or have a canary deployment in place before allowing full roll-outs into production fleets.

sateesh
2 replies
11h16m

Same lesson for enterprise customers as well -- test new distributions on non-production within your IT setup, or have a canary deployment in place before allowing full roll-outs into production fleets.

It was mentioned in one of the HN threads, that the update was pushed overriding the settings customer had [1]. What recourse any customer can have in in such a case ?

1. https://news.ycombinator.com/item?id=41003390

teeheelol
0 replies
8h50m

Ah that was me. We don’t accept “content updates” and they are staged.

We got this update pushed right through.

perryizgr8
0 replies
9h36m

What recourse any customer can have in in such a case ?

Sue them and use something else.

rramadass
0 replies
9h9m

Nice.

But the problem here is that the code runs in kernel mode. As such any data that it may consume should have been tested with the same care as the code itself which has never been the case in this industry.

Wytwwww
0 replies
6h42m

It will increase cost

And of of course that cost would be absolutely insignificant relative to the potential risk...

treflop
4 replies
13h49m

I’ve seen places where failed releases are just “part of normal engineering.” Because no one is perfect, they say.

slenk
2 replies
13h18m

I really dislike this mentality. Don't even get me started on celebrating when your rocket blows up

galangalalgol
0 replies
12h48m

If it is a standard production rocket, I agree. If it is a first of kind or even third of kind launch, celebrating the lessons learned from a failure is a healthy attitude. This production software is not the same thing at all.

Heliosmaster
0 replies
12h33m

spaceX celebrating when their rocket blows up after a certain milestone it's like us devs celebrating when our branch with that new big feature only fails a few tests. Did it pass no? Are you satisfied as first try? Probably

photonthug
0 replies
9h27m

Even on hn, comments advocating engineering excellence or just quality in general are frequently looked down on, which probably also tells you a lot about the wider world.

This is why we can’t have nice things, but maybe we just don’t want them anyway? “Mistakes will be made” is way less true if you actually put the effort in to prevent them, but I am beginning to think this has become code for quiet-quitters to telegraph a “I want to get paid for no effort and sympathize with others who feel the same” sentiment and appear compassionate and grimly realistic all at the same time.

yes, billion dollar companies are going to make mistakes, but almost always because of cost cutting, willful ignorance, or negligence. If average people are apologizing for them and excusing that, there has to be some reason that it’s good for them.

someonehere
2 replies
14h50m

That’s what a lot of us are wondering. There’s a lot of outside thinking of the box right now about this in certain circles.

IAmGraydon
1 replies
12h50m

There’s no point in leaving vague allusions. Can you expand on this?

kbar13
0 replies
12h21m

security industry's favorite language is nothingspeak

itronitron
1 replies
12h17m

for all we know, the deployment was the test

owl57
0 replies
9h12m

As the old saying goes, everyone has a test environment, and some also have a separate production one.

usrusr
0 replies
13h33m

One possible explanation could be automated testing deployments for definitions updates that don't run the current version of the definition consumer, and the old one they do run is unaffected.

masfuerte
0 replies
9h13m

I find it hard to believe they didn't do any testing. I wonder if they tested the virus signatures against the engine, but didn't check the final release artefact (the .sys file) and the bug was somehow introduced in the packaging step.

This would have been poor, but to have released it with no testing would have been the most staggering negligence.

qmarchi
37 replies
16h59m

Meta Conversation: The fact that X has a "Show Probable Spam" and both of the responses were pretty valid, with one even getting a reply from the creator.

I just don't understand how they still have users.

hipadev23
21 replies
16h51m

There’s literally not a better alternative and nobody seems to be earnestly trying to fill that gap. Threads is boomer chat with an instagram requirement. Every Mastodon instance is slow beyond reason and it’s still confusing to regular users in terms of how it works. And is Bluesky still invite only? Honestly haven’t heard about it in a long time.

fragmede
7 replies
16h11m

Threads is boomer chat with an instagram requirement.

You're being too dismissive of Threads. It's fine, there are adults there.

What weirdo doesn't have an insta?

II2II
2 replies
15h54m

raises hand

Some people don't jump on every fad out there. Most of the people who miss out on fads quickly realize that they aren't losing out on much simply because fads are so ephemeral. As far as I can tell, this is normal (though different people will come to that realization at different stages of their life).

fragmede
1 replies
15h49m

Facebook is going to run threads for as long as it wants, time will tell if it's a fad or not. Is ChatGPT a fad?

II2II
0 replies
15h16m

While a fad (in this context) depends upon a company maintaining a product, the act of maintaining a product is not a measure of how long the fad lasts. Take Facebook, the product. I'm fairly certain that it is long past its peak as a communications tool between family, friends, and colleagues. Facebook, the company, remains relevant for other reasons.

As for ChatGPT, I'm sure time will prove it is a fad. That doesn't mean that LLMs are a fad (though it is too early to tell).

zdragnar
0 replies
12h57m

I don't have any social media of any kind, unless you count HN.

My wife only uses Facebook, and even then pretty sparingly.

shzhdbi09gv8ioi
0 replies
10h5m

I never had insta. Why would anyone use that.

mardifoufs
0 replies
7h49m

Sadly enough the "average" instagram user doesn't use threads. It's just a weird subset of them that use it, and imo it's not the subset that makes Instagram great lol. (It's a lot of pre 2021 twitter refugees, and that's an incredibly obnoxious and self centered crowd in my experience)

macintux
0 replies
16h6m

Some of us stay far, far away from Facebook.

ric2b
3 replies
16h43m

Mastodon doesn't feel any slower to me than Twitter, maybe I got lucky, according to you?

r2vcap
1 replies
16h17m

Maybe the experience varies depending on where the user is located. Users near Mastodon servers (possibly on the US East or West Coast) may not feel the slowness as much as users in other parts of the world. I notice noticeably slower response times when I use Mastodon in my location (Korea).

robjan
0 replies
15h36m

I think a lot of people use Hetzner. I notice slowness, especially with media, in Hong Kong. A workaround I've found is to use VPNs which seem to utilise networks with better peering with local ISPs

MBCook
0 replies
16h29m

Same. I have no issues at all on Mastodon. I’m quite happy with it.

TechSquidTV
2 replies
16h14m

Mastodon is a PERFECT replacement. But it'll never win because there isn't a business propping it up and there is inherent complexity, mixed with the biggest problem, cost.

No one wants to pay for anything, and that's the true root of every issue around this. People complain YouTube has ads, but wont buy premium. People hate Elon and Twitter but won't take even an ounce of temporary inconvenience to try and solve it.

Threads exists, I'm happy they integrate with Activity Pub, which should give us the best of both worlds. Why don't people use Threads? I'd a little more popular outside the US but personally, I think the "algorithm" pushes a lot of engagement bait nonsense.

jnurmine
0 replies
8h57m

Mastodon - mixed feelings.

In my experience, Mastodon is nice until you want to partake in discussions. To do so, you need an account.

With an account you can engage in civilized discussions. Some people don't agree with you, and you don't agree with some people. That's fine, maybe you'll learn something new. It's a discussion.

And then, suddenly, a secret court convenes and kills your account just like that; no reason will be given, no recourse will be available, admins won't reply, and you can do two things: go away for good, or try again on a different server.

I'm happy with a read-only Mastodon via a web interface.

But read-write? Never again, I probably don't have the correct ideology for it.

doodlebugging
0 replies
14h29m

No one wants to pay for anything, and that's the true root of every issue around this. People complain YouTube has ads, but wont buy premium.

Perhaps if buying into a service guaranteed that they would not be sold out then there would be more engagement. When someone signs up it is pretty much a rock-hard guarantee that their personal information will be marketed and sold to any entity with the money and interest to buy it - paying customers, free-loaders, etc.

When someone chooses to buy your app or SaaS then they should be excluded from the list of users that you sell or trade between "business partners".

When paying for a service guarantees that you're selling all details of your engagement with that service to unrelated business entities you have a disincentive to pay.

People are wising up to all this PII harvesting and those clowns who sold everyone out need to find a different model or quit bitching when real people choose to avoid their "services" since most of these things are not necessary for people to enjoy life anyway. They are distractions.

EDIT: This is not intended as a personal attack on you but is instead a general observation from the perspective of someone who does not use or pay for any apps or SaaS services and who actively avoids handing out accurate personal information when the opportunity arises.

lutoma
1 replies
16h28m

Every Mastodon instance is slow beyond reason and it’s still confusing to regular users in terms of how it works.

I'll concede the confusing part but all the major Mastodon servers I interact with regularly are pretty quick so I'm not sure where that part comes from.

Lt_Riza_Hawkeye
0 replies
16h22m

It is not so bad with Mastodon but much fedi software gets slower the longer it's been running. "Akkoma Rot" is the one that's typically most talked about but the universe of misskey forks experiences the same problems, and Mastodon can sometimes absolutely crunch to a halt on 4GB of ram even for a single user instance.

shzhdbi09gv8ioi
0 replies
10h8m

Strange take.. Mastodon is where alot of the IT discussion happens these days.

The quality vs crap ratio is stellar on mastodon. Not so much on anywhere else.

honeybadger1
0 replies
16h47m

It is the best internet social feed to me as well. I use pro a lot for following different communities and there is nothing that can comes close today to being on the edge of change online.

cageface
0 replies
16h32m

All the people I know that are still active on Twitter because they need to be "informed" are constantly sending me alarmist "news" that breaks on Twitter that, far more often than not, turns out to be wrong.

add-sub-mul-div
0 replies
16h24m

And is Bluesky still invite only?

Not since February. But it's for the best that the Eternal September has remained quarantined on Twitter.

ants_everywhere
5 replies
16h21m

Relatedly, it's crazy to me how many people still get their news from X. I mean serious people, not just Joe Schmoe.

The probable spam thing was nuts to me too. My guess was it's maybe trying to detect users with lower engagement. Like people who aren't moving the investigation forward but are trying to follow it and be in the discussion.

pyinstallwoes
1 replies
15h58m

Relatedly, it’s crazy to me how many people still get news from the Sunday times!

jen729w
0 replies
15h46m

Relatedly, it's crazy to me how many people still read the news!

AnthonyMouse
1 replies
15h43m

One of the things to keep in mind is that Twitter had most of these misfeatures before Musk bought it.

The basic problem is, no moderation results in a deluge of spam and algorithmic moderation is hot garbage that can only filter out the bulk of the spam by also filtering out like half of the legitimate comments. Human moderation is prohibitively expensive unless you want to hire Mechanical Turk-level moderators and not give them enough time to do a good job, in which case you're back to hot garbage.

Nobody really knows how to solve it outside of the knob everybody knows about that can improve the false negative rate at the expense of the false positive rate or vice versa. Do you want less ham or more spam?

ants_everywhere
0 replies
5h41m

I agree the problem is hard from a technical level.

The problem is also getting significantly worse because it's trivial to generate entire pages of inorganic content with LLMs.

The backstories of inorganic accounts are also much more convincing now that they can be generated by LLMs. Before LLMs, backstories all focused on a small handful of topics (e.g. sports, games) because humans had to generate them from playbooks of best pracitces. Now they can be into almost anything.

ungreased0675
0 replies
15h29m

When something big happens, Twitter is probably the best place to get real time information from people on location.

Most everything else goes through a filter and pasteurization before public consumption.

mardifoufs
2 replies
15h48m

If bad spam detection was such a big issue for a social platform, YouTube wouldn't be used by anyone ;). In fact it's even worse on YouTube, it's the same pattern of accounts with weird profile pictures copy pasting an existing comment as is and posting it, for thousands of videos, and it's been going on for a year now. It's actually so basic that I really wonder if there's some other secret sauce to those bots to make them undetectable.

omoikane
1 replies
15h19m

Well if it's just the comments, I think a lot of people just don't read those. In fact, it's a fair bit of effort just to read the descriptions with the YouTube app on some devices (e.g. smart TVs), and it's really not worth the effort to read the comments when users can just move on to the next video.

mardifoufs
0 replies
13h56m

I don't necessarily think that's true anymore. YouTube comments are important to the algorithm so creators are more and more active in the comment section, and the comments in general have been a lot more alive and often add a lot of context or info for some type of videos. YouTube has also started giving the comments a lot more visibility in the layout (more than say, the video description). But you're probably right w.r.t platforms like TVs.

Before this wave of insane bot spam, the comments had started to be so much better than what they used to be (low effort, boomer spam). In fact I think they were much better than the absolute cringy mess that comments on dedicated forums like Reddit turned into

wrycoder
0 replies
15h50m

When I see that, I usually upvote it.

honeybadger1
0 replies
16h53m

I believe that is dependent on your account settings. I block all comments on accounts that do not have a verified phone number as an example and they get dropped into that.

fireflies_
0 replies
16h53m

I just don't understand how they still have users.

Because this post is here and not somewhere else. Strong network effects.

dclowd9901
0 replies
16h15m

I had to log in to see responses. Pretty sure that’s how they still have users.

ascorbic
0 replies
11h54m

I'd go so far to say that almost all responses that I see under "probable spam" are legitimate. Meanwhile real spam is everywhere in replies, and most ads are dropshipped crap and crypto scams with community notes. It's far worse than it's ever been before.

Jimmc414
0 replies
16h13m

I use X solely for the AI discussions and I actively curate who I follow, but where is there a better platform to join in conversations with the top 500 people in a particular field?

I always assumed that the reason legit answers often fall under "Show probable spam" is because of the inevitable reports coming in on controversial topics. It seems like the community notes feature works well most of the time.

siscia
35 replies
9h5m

The thing I don't understand about all of this is another, much less technical and much more important.

Why the blas radius was so huge?

I have deployed much less important services much more slowly with automatic monitoring and rollback in place.

You first deploy to beta, where you don't get customers traffic, if everything goes right to a small part of your fleet, and slowly increase the percentage of hosts that receives the updates.

This would have stopped the issue immediately, and I somehow I thought it was common practices...

vbezhenar
28 replies
8h51m

It wasn't software update. It was signature database update. It's supposed to roll out as fast as possible. When you learn about new virus, it's already in the wild, so every minute counts. You don't want to delay update for a day just to find out that your servers were breached 20 hours ago.

TeMPOraL
13 replies
8h17m

We can see clearly now that this is a stupid approach. Viruses don't move that fast.

This situation is akin to the immune system overreacting and melting the patient in response to a papercut. This sometimes happens, but it's considered a serious medical condition, and I believe the treatment is to nuke someone's immune system entirely with hard radiation, and reinstall a less aggressive copy. Take from that analogy what you want.

orf
11 replies
7h38m

Viruses don't move that fast

Yes they do? And it’s more akin to a shared immune system than a single organism.

In this case, it’s not like viruses move fast relative to the total population of machines, but within the population of machines being targeted they do move fast.

proveitbh
9 replies
5h54m

Cite one virus thay crashed the supposed 10 or 100 million machines in 70 minutes.

Just one.

hello_moto
2 replies
3h36m

The malware doesn't need to infect 100 million machines.

It just needs to infect 200k devices to get to the pot: hundred million dollars of ransomware.

TeMPOraL
1 replies
3h4m

It's a trivial cost to pay if the alternative is CrowdStrike inflicting billions of dollars of damage and loss of life across several countries.

(I expect this to tally up to double-digit billions and thousands of lives lost directly to the outages when the dust settles.)

hello_moto
0 replies
2h50m

Trivial cost to pay from which side?

The organization like MGM and London Drugs?

echoangle
2 replies
4h33m

Can you explain why you find this idea of fast moving viruses so improbable? Just from the way the internet works, I wouldn’t be surprised if every reachable host could be infected in a few hours if the virus can infect a machine in a short time (a few seconds) and would then begin infecting other machines. Why is that so hard to imagine?

SoftTalker
1 replies
2h12m

Proper firewalling for one. "Every reachable host" should be a fairly small set, ideally an empty set, when you're on the outside looking in.

And operating systems aren't that bad anymore. You don't have services out of the box opening ports on all the interfaces, no firewalls, accepting connections from everywhere, and using well-known default (or no) credentials.

Even stuff like the recent OpenSSH bug that is remotely exploitable and grants root access wasn't anything close to this kind of disaster because (a) most computers are not running SSH servers on the public internet (b) the exploit is rather difficult to actually execute. Eventually it might not be, but that gives people a bit of breathing space to react.

Most cyberattacks use old, unpatched vulnerabilites against unprotected systems combined with social engineering to get the payload past the network boundary. If you are within a pretty broad window of "up to date" on your OS and antivirus updates, you are pretty safe.

echoangle
0 replies
1h49m

The focus seems to have been the time limit though. All the reasons you mention are just that there aren’t even that many targets.

orf
0 replies
5h27m

Microsoft puts the count at 8.5 million computers. So, percentage wise, the MyDoom virus in 2004 infected a far greater % of computers in a month: which in the context of internet penetration, availability and speeds (40kb/s average, 450kb/s fastest) in 2004 was about as fast as it could have. So it might as well have been 70 minutes, given downloading a 50mb file on dial up would take way longer than 70 mins.

To the smart people below:

It’s clear to everyone that 70 minutes is not 1 month. The point is that it’s not a fair comparison: it would simply not have been possible to infect that many computers in 70 minutes: the internet infrastructure just wasn’t there.

It’s like saying “the Spanish flu didn’t do that much damage because there where less people on the planet” - it’s a meaningless absolute comparison, whereas the relative comparison is what matters.

nullindividual
0 replies
55m

https://www.caida.org/catalog/papers/2003_sapphire/

[SQL] Slammer spread incredibly quickly, even though the vulnerability was patched in the prior year.

As it began spreading throughout the Internet, it doubled in size every 8.5 seconds. It infected more than 90 percent of vulnerable hosts within 10 minutes.

Worms are not technically viruses, but they can have similar impacts/perform similar tasks on an infected host.

8organicbits
0 replies
5h21m

ILOVEYOU is a pretty decent contender, although the Internet was smaller back then and it didn't "crash" computers, it did different damage. Computer viruses and worms can spread extremely quickly.

infected millions of Windows computers worldwide within a few hours of its release

See: https://en.wikipedia.org/wiki/Timeline_of_computer_viruses_a...

TeMPOraL
0 replies
7h2m

Still, better to let them spread a bit and deal with the localized damage than risk nuking everything. There is such a thing as treatment that's very effective, but not used because of a low probability risk of terminal damage.

pyeri
6 replies
8h11m

But why does a signature database update have to mess with the kernel in any kind of way? Shouldn't such a database stay in the user land?

theshrike79
4 replies
7h21m

The scanner is a Ring 0[0] program. Windows only has 2 options 0 and 3. 3 won't work for any kind of security scanners, so they're forced to use 0.

The proper place would be Ring 1, which doesn't exist on Windows.

And being a kernel-level operation, it has the capability to crash the whole system before the actual OS has any chance to intervene.

[0] https://en.wikipedia.org/wiki/Protection_ring

leosarev
3 replies
6h6m

Why is so?

hello_moto
0 replies
3h26m

That's a question for Microsoft OS architects

benchloftbrunch
0 replies
2h40m

Historical reasons. Windows NT was designed to support architectures with only two privilege rings.

vbezhenar
0 replies
7h35m

Because kernel needs to parse the data in some way and that parser apparently was broken enough. Whether it could be done in a more resilient manner, I don't know, you need to remember that antivirus works in hostile environment and can't necessarily trust userspace, so probably they need to verify signatures and parse payload in the kernel space.

jrochkind1
1 replies
5h36m

Yup. If they were delaying update to half of their customers for 24 hours, and in that 24 hours some of their customers got hacked by a zero day, say leading to ransomeware, the comment threads would be demanding their head for that!

sateesh
0 replies
5h22m

Even if it is a staged rollout why would one do it in 24 hour phases ? It can be a hourly (say) staggered rollout too.

Ensorceled
1 replies
6h27m

Surely there is a happy medium between zero (nil,none,nada,zilch) staging and 24 hours of rolling updates? A single 30 second or so VM test would have revealed this issues.

layer8
0 replies
25m

There should have been a test catching the error before rollout, however this doesn’t require a staged rollout as suggested by the GP comment, testing the update at some customers (which would still be hosed in that case), it only requires executing the test before the rollout.

siscia
0 replies
7h13m

Thanks for the clarification, this makes more sense.

sateesh
0 replies
5h25m

It doesn't matter what kind of update it was: signature, content,etc. Only thing that matters is does the update has a potential to disrupt the user's normal activity (leave alone bricking the host), if yes ensure it either works or have a staged rollout with a remediation plan.

LeonB
0 replies
6h52m

It’s quite impressive really — crowdstrike were deploying a content update to all of their servers to warn them of the “nothing but nulls, anti-crowdstrike virus”

Their precognitive intelligence suggested that a world wide attack was only moments away. The same precognitive system showed that the virus was so totally incapacitating that the only safe response was to incapacitate the server.

Knowing that the virus was capable of taking down every crowdstrike server, they didn’t waste time trying it on a subset of servers.

When you know you know.

rplnt
1 replies
3h38m

It's answered in the post (in the thread) as well. But for comparison, when I worked for an AV vendor we pushed maybe 4 updates a day to a much bigger customer base (if the numbers reported by MS are true).

kchr
0 replies
2h26m

I'm curious, what did your deployment plan look like? Phased/staggered, if so how?

robxorb
0 replies
8h49m

"Blast radius" seems... apt.

It would be rather easier to understand and explain if it were intentional. Likely not able to be discussed though.

Anyone able to do that here?

moogly
0 replies
9h1m

They don't seem to dogfood their own software. They don't seem to think it's very useful software in their own org, I guess.

andy81
0 replies
8h54m

Even if there was a canary release process for code updates, the config updates seem to have been on a separate channel.

The expectation being that people want up-to-date virus detection rules constantly even if they don't want potentially breaking changes.

The missed edge case being an untested config that breaks existing code.

Source: Pure speculation, don't quote this in news articles.

INTPenis
0 replies
8h59m

Considering the impact this incident had they definitely should have a large staging environment of windows clients to deploy first.

There are so many ways to avoid this issue, or at least minimize the risk of it happening, but as always profits come before people.

Fr0styMatt88
34 replies
16h33m

The scarier thought I've had -- if a black hat had discovered this crash case, could it have been turned into a widely deployed code execution vulnerability?

phire
17 replies
16h3m

No.

To trigger the crash, you need to write a bad file into C:\Windows\System32\drivers\CrowdStrike\

You need Administrator permissions to write a file there, which means you already have code execution permissions, and don't need an exploit.

The only people who can trigger it over network are CrowdStrike themselves... Or a malicious entity inside their system who controls both their update signing keys, and the update endpoint.

cyrnel
9 replies
15h53m

Anyone know if the updates use outbound HTTPS requests? If so, those companies that have crappy TLS terminating outbound proxies are looking juicy. And if they aren't pinning certs or using CAA, I'm sure a $5 wrench[1] could convince one of the lesser certificate authorities to sign a cert for whatever domain they're using.

[1]: https://xkcd.com/538/

phire
4 replies
15h47m

The update files are almost certainly signed.

Even if the HTTPS channel is compromised with a man-in-the-middle attack, the attacker shouldn't be able to craft a valid update, unless they also compromised CrowdStrke's keys.

However, the fact that this update apparently managed to bypass any internal testing or staging release channels makes me question how good CrowdStrike's procedures are about securing those update keys.

cyrnel
2 replies
15h21m

Depends when/how the signature is checked. I could imagine a signature being embedded in the file itself, or the file could be partially parsed before the signature is checked.

It's wild to me that it's so normal to install software like this on critical infrastructure, but questions about how they do code signing is a closely guarded/obfuscated secret.

phire
0 replies
13h42m

Sure, it's certainly possible.

Though, I prefer to give people benefit of doubt for this type of thing. IMO, the level of incompetence to parse a binary file before checking the signature is significantly higher (or at least different) than simply pushing out a bad update (even if the latter produces a much more spectacular result).

Besides, we don't need to speculate. We have the driver. We have the signature files [1]. Because of the publicity, I bet thousands of people are throwing it into Binary RE tools right now, and if they are doing something as stupid as parsing a binary file before checking it's signature (or not checking a signature at all), I'm sure we will hear about it.

We can't see how it was signed because that's happening on Cloudstrike's infrastructure, but checking the signature verification code is trivial.

[1] Both in this zip file: https://drive.google.com/file/d/1OVIWLDMN9xzYv8L391V1ob2ghp8...

jmb99
0 replies
14h5m

Kind of a side talent, but I’m currently (begrudgingly) working on a project with a Fortune 20 company that involves a complicated mess of PKI management, custom (read: non-standard) certificates, a variety of management/logging/debugging keys, and (critically) code signing. It’s taken me months of pulling teeth just to get details about the hierarchy and how the PKI is supposed to work from my own coworkers in a different department (who are in charge of the project), let alone from the client. I still have absolutely 0 idea how they perform code signing, how it’s validated, or how I can test that the non-standard certificates can validate this black-hole-box code signing process. So yeah, companies really don’t like sharing details about code signing.

emmelaich
2 replies
15h11m

My speculation is the bit of code/data that was broken, is added after the build and testing precisely to avoid the $5 wrench attack.

That is, the data is signed and they don't want to use the real signing key during testing / in the continuous build because then it is too exposed.

So it's added after as something that "could not break". But it of course did.

phire
1 replies
13h7m

I can think of a bunch of different answers:

This wasn't a code update, just a configuration update. Maybe they don't put config update though QA at all, assuming they are safe.

It's possible that QA is different enough from production (for example debug builds, or signature checking disabled) that it didn't detect this bug.

Might be an ordering issue, and that they tested applying update A then update B, but pushed out update B first.

The fact that it instantly went out to all channels is interesting. Maybe they tested it for the beta channel it was meant for (and it worked, because that version of the driver knew how to cope with that config) but then accidentally pushed it out to all channels, and the older versions had no idea what to do wiht it.

Or maybe they though they were only sending it to their QA systems but pushed the wrong button and sent it out everywhere.

emmelaich
0 replies
9h46m

This wasn't a code update, just a configuration update

Configuration is data, data is code.

gruez
0 replies
15h44m

that's assuming they don't do cert pinning. Moreover despite all the evil things you can supposedly do with a $5 wrench, I'm not aware of any documented cases of this sort of attack happening. The closest we've seen are misissuances seemingly caused by buggy code.

Animats
5 replies
15h40m

How does it validate the updates, exactly?

Microsoft supposedly has source IP addresses known by their update clients, so that DNS spoofing won't work.

FreakLegion
2 replies
15h10m

Microsoft signs its updates. There's no restriction on where you can get them from.

ffhhj
1 replies
14h52m

Microsoft has previously leaked their keys.

FreakLegion
0 replies
13h31m

Not that I recall.

Microsoft has leaked keys that weren't used for code signing. I've been on the receiving end of this actually, when someone from the Microsoft Active Protections Program accidentally sent me the program's email private key.

Microsoft has been tricked into signing bad code themselves, just like Apple, Google, and everyone else who does centralized review and signing.

Microsoft has had certificates forged, basically, through MD5 collisions. Trail of Bits did a good write-up of this years ago.

But I can't think of a case of Microsoft losing control of a code signing key. What are you referring to?

Randor
1 replies
14h35m

As a former member of the Windows Update software engineering team, I can say this is absolutely false. The updates are signed.

Animats
0 replies
12h3m

I know they are signed. But is that enough?

Attackers today may be willing to spend a few million dollars to access those keys.

jackjeff
0 replies
10h2m

If you get have privileged escalation vulnerability there are worse things you can do. Just making the system unbootable by destroying the boot sector/EFI partition and overwriting system files. No more rebooting in safe mode and no more deleting a single file to fix the boot.

This would probably be classified as a terrorist attack and frankly it’s just a matter of time until we get one some day. A small dedicated team could pull it off. It’s just so happens that the people with the skills currently either opt for cyber criminality (crypto lockers and such), work for a state actor (think Stuxnet) or play defense in a cyber security firm.

naveen99
9 replies
16h15m

The hard part is the deploying. Yes if you can get control of the crowdstrike deployment machinery, you can do whatever you want on hundreds of millions of machines. but you don’t need any vulnerabilities in the crowdstrike deployed software for that only the deploying servers.

tranceylc
7 replies
15h58m

Call me crazy but that is a real worry for me, and has been for a while. How long until we see some large corporate software have their deployment process hijacked, and have it affect a ton of computers that auto-update?

btown
3 replies
15h48m

One of the most dangerous versions of this IMO is someone who compromises a NPM/Pypi package that's widely used as a dependency. If you can make it so that the original developer doesn't know you've compromised their accounts (spear-phished SIM swap + email compromise while the target is traveling, for instance, or simply compromising the developer themselves), you don't need every downstream user to manually update - you just need enough projects that aren't properly configured with lockfiles, and you've got code execution on a huge number of servers.

I'm hopeful that the fallout from Crowdstrike will be a larger emphasis on software BOM risk - when your systems regularly phone home for updates, you're at the mercy of the weakest link in that chain, and that applies to CI/CD and end user devices alike.

IncreasePosts
1 replies
15h9m

It makes me wonder how many core software libraries to modern infrastructure could be compromised by merely threatening a single person.

jmb99
0 replies
14h1m

As always, a relevant xkcd[1]. I would not be surprised if the answer to “how many machines can be compromised in 24 hours by threatening one person” was less than 8 figures. If you can find the right person, probably 9+.

[1] https://xkcd.com/2347/

leni536
0 replies
9h41m

Just compromise one popular vim plugin and you have dev access to half of the industry.

spydum
0 replies
15h54m

I mean, isn't that roughly the solarwinds story? There is no real shortage of supply chain incidents in the last few years. The reality is we are all mostly okay with that tradeoff.

inferiorhuman
0 replies
15h26m

   if you can get control of the crowdstrike deployment machinery
Or combine a lack of certificate pinning with BGP hijacking.

plorkyeran
3 replies
16h21m

Shockingly it turns out that installing a rootkit can have some negative security implications.

llm_trw
2 replies
15h54m

Trying to explain to execs that giving someone root access to your computers means they have root access to your computers is surprisingly difficult.

tonetegeatinst
0 replies
15h15m

I mean kernal level access does provide feature not accessible in userspace. Is it alsooverused when other solutions exist, you bet.

Most people don't need this stuff. Just keeping shit up to date, no not on the nightly build branch, but like installing windows update atleast a day or two after they come out. Or maby regular antivirus scans.

But let's be honest, your kernal drivers are useless if your employees fall for phishing or social engineering. See then its not malware, its an authorized user on the system....just copying data onto a USB drive or a rouge employee taking your customer list to your competition. That fancy pants kernal driver might be really good at stopping sophisticated threats and I'm sure the marketing majors at any company cram products full of buzz words. But remember, you can't fix incompetent or malicious employees unless your taking steps to prevent it.

What's more likely: some foreign government hacking khols? Or a script kiddie social engineers some poor worker pretending to be the support desk?

Not here to shit on this product, it has its place and it obviously does a good job....(heard its expensive but most xrd/edr is)

Seems like we are learning how vulnerable certain things are once again. As a fellow security fellow, I must say that Jia Tan must be so envious that he couldn't have this level of market impact.

rdtsc
0 replies
12h40m

Start a story for them: "and then, the hackers managed to install a rootkit which runs in kernel mode. The rootkit has sophisticated C2 mechanism with configuration files pretending to be drivers suffixed with .sys extensions. And then, they used that to prevent hospitals and 911 systems around the world from working, resulting in delayed emergency responses, injuries, possibly deaths".

After they cuss the hackers under their breath exclaiming something like: "they should be locked up in jail for the rest of their lives!...", tell them that's exactly what happened, but CS were the hackers, and maybe they should reconsider mandating installing that crap everywhere.

Murky3515
0 replies
16h9m

Probably would've been use to mine bitcoin before it was patched

MBCook
0 replies
16h30m

I had that same one. If loading a file crashed the kernel module, could it have been exploitable? Or was there a different exploitable bug in there?

Did any nation states/other groups have 0-days on this?

Did this event reveal something known to the public, or did this screw up accidentally protect us from someone finding + exploiting this in the future?

nickm12
31 replies
14h45m

It's really difficult to evaluate the risk the CrowdStrike system imposed. Was this a confluence of improbable events or an inevitable disaster waiting to happen?

Some still-open questions in my mind:

- was the broken rule in the config file (C-00000291-...32.sys) human authored and reviewed or machine-generated?

- was the config file syntactically or semantically invalid according to its spec?

- what is the intended failure mode of the kernel driver that encounters an invalid config (presumably it's not "go into a boot loop")?

- what automated testing was done on both the file going out and the kernel driver code? Where would we have expected to catch this bug?

- what release strategy, if any, was in place to limit the blast radius of a bug? Was there a bug in the release gates or were there simply no release gates?

Given what we know so far, it seems much more likely that this was a "disaster waiting to happen" but I still think there's a lot more to know. I look forward to the public post-mortem.

Guthur
13 replies
13h48m

The glaring question is how and why it was rolled out everywhere all at once?

Many corporations have pretty strict rules on system update scheduling so as to ensure business continuity in case of situations like this but all of those were completely circumvented and we had fully synchronised global failure. It really does not seem like business as usual situation.

xvector
4 replies
13h30m

CrowdStrike's reasoning is that an instantaneous global rollout helps them protect against rapidly spreading malware.

However, I doubt they need an instantaneous rollout for every deployment.

kijin
1 replies
12h49m

Well, millions of PCs bluescreening at the same time does help stop a rapidly spreading malware.

Only this time, crowdstrike itself has become indistinguishable from malware.

imtringued
0 replies
11h3m

Whe I first saw news about the outage I was wondering what this malware "CrowdStrike" was. I mean, the name kind of sounds hostile.

slenk
0 replies
13h20m

I feel like they need to at least first rollout to themselves

TeMPOraL
0 replies
10h17m

They say that, but all I hear is immune system triggering a cytokine storm and killing you because it was worried you may catch a cold.

inejge
4 replies
13h13m

The glaring question is how and why it was rolled out everywhere all at once?

Because the point of these updates is to be rolled out quickly and globally. It wasn't a system/driver update, but a data file update: think antivirus signature file. (Yes, I know it can get complicated, and that AV signatures can be dynamic... not the point here.)

Why those data updates skipped validity testing at the source is another question, and one that CrowdStrike better be prepared to answer; but the tempo of redistribution can't be changed.

Brybry
2 replies
12h37m

But is there a need for quick global releases?

Is it realistic that there's a threat actor that will be attacking every computer on the whole planet at once?

I can understand that it's most practical to update everyone when pushing an update to protect a few actively under attack but I can also imagine policies where that isn't how it's done, while still getting urgent updates to those under attack.

padjo
1 replies
11h9m

Is there a need? Maybe, possibly, depends on circumstances.

Is this what people are paying CS for? Absolutely.

RowanH
0 replies
9h50m

After this I imagine there will be an option "do you want updates immediately, or updates when released - n, or n+2, n+6, n+24, n+48 hrs?"

Given the choice I bet there's going to be surprisingly large number of orgs go "we'll take n+24hrs thanks"

maeil
0 replies
12h36m

A customer should be able to test an update, whether a signature file or literally any kind of update, before rolling it out to production systems. Anything else is madness. Being "vulnerable" for an extra few hours carries less risk than auto-updates (of any kind) on production systems. As we've seen here. If you can point to hard evidence to the contrary, where many companies were saved just in time because of a signature update and would have been exploited if they'd waited a few hours, I'd love to read about it. It would have to have happened on a rather large scale for all of the instances combined to have had a larger positive impact than this single instance.

hmottestad
0 replies
11h58m

From the article on The Verge it seems that this kind of update is downloaded automatically even if you disable automatic updates. So those users who took this kind of issue seriously would have thought that everything was configured correctly to not automatically update.

danielPort9
0 replies
7h56m

The glaring question is how and why it was rolled out everywhere all at once?

Because it worked good for them so far? There are plenty of companies that do the same and we don’t hear about them until something goes wrong.

chii
0 replies
13h30m

strict rules on system update scheduling

which crowdstrike gets to bypass because they claime themselves as an antivirus and malware detection platform - at least, this is what the executives they've wined and dined into the purchase contracts have been told. The update schedule is independently controlled by crowdstrike, rather than by a system admin i believe.

hdhshdhshdjd
8 replies
14h24m

Was somebody trying to install an exploit or back door and fucked up?

TechDebtDevin
7 replies
14h6m

Everything is a conspiracy now eh?

hdhshdhshdjd
5 replies
13h57m

You do remember Solarwinds right? This is an obvious high value target, so it is reasonable to entertain malicious causes.

Given the number of systems infected, if you could push code that rebooted every client into a compromised state you’d still have run of some % of the lot until it was halted. That time window could be invaluable.

Now, imagine if you screw up the code and just boot loop everything.

I’d say business wise it’s better for crowd strike to let people think it’s an own-goal.

The truth may be mundane but a hack is as reasonable a theory as “oops we pushed boot loop code to world+dog”.

saagarjha
4 replies
13h44m

The truth may be mundane but a hack is as reasonable a theory as “oops we pushed boot loop code to world+dog”.

No it's not. There are many signs that point to this being a mistake. There are very few that point to it being a hack. You can't just go "oh it being a hack is one of the options therefore it is also something worth considering".

azinman2
1 replies
13h14m

Especially because if it was crowdstrike wouldn’t be apologizing and accepting blame.

owl57
0 replies
11h1m

Why? They are in a very specific business and have more incentive to cover up successful attacks than most other companies.

And while I'm 99% for Hanlon's razor here, I don't see a reason to be sure it wasn't even a completely successful DoS attack.

Huggernaut
1 replies
12h14m

Look there are two options on the table so it's 50/50. Ipso facto.

bunabhucan
0 replies
11h37m

I believe the flying spaghetti monster touched the file with His invisible noodly appendage so now it's a three way split.

refulgentis
4 replies
14h36m

Would any of these, or even a collection of these, resolving in some direction make it highly improbable that it'll never happen again?

Seems to me 3rd party code, running in the kernel, on parsed inputs, that can be remotely updated is enough to be disaster waiting to happen gestures breezily at Friday

That's, in the Taleb parlance, a Fat Tony argument, but barring it being a cosmic ray causing a uncorrected bit flop during deploy, I don't think there's room to call it anything but "a disaster waiting to happen"

slt2021
1 replies
14h0m

kernel driver could have data check on the channel file and fail gracefully/ignore wrong file instead of BSOD.

this code is executed only once during the driver initialization, so shouldn't be much overhead, but will greatly improve reliability against broken channel file

refulgentis
0 replies
12h29m

This is going to code as radical, but I always assumed it was derivable from bog-standard first principles that would fit in any economics class I sat in for my 40 credits:

the natural cost of these bits we sell is zero, so in the long run, if the bar is "just write a good & tested kernel driver", there will always be one more subsequent market entrant who will go too cheap on engineering. Then, they touch the hot wire and burn down the establishment.

That doesn't mean capitalism bad, but it does mean I expect only Microsoft is capable of writing and maintaining this type of software in the long run.

Ex. The dentist and dental hygienist were asking me who was attacking Microsoft on Friday, and they were not going to get through to the the subtleties of 3rd kernel driver release gating strategy.

MS has a very strong incentive to fix this. I don't know how they will. But I love when incentives align and assume they always will, in the long run.

nickm12
0 replies
12h29m

Yes, if CrowdStrike was following industry best practices and this happened, it would teach us something novel about industry practices that we could learn from and use to reduce the risk of a similar scale outage happening again.

If they weren't following these practices, this is kind of a boring incident with not much to be learned, despite how dramatic the scale is. Practices like staged rollout of changes exist precisely because we've learned these lessons before.

YZF
0 replies
12h1m

Well, kernel code is kernel code, and kernel code in general takes input from outside the kernel. An audio driver takes audio data, a video driver might take drawing instructions, a file system interacts with files, etc. Microsoft, and others, have been releasing kernel code since forever and for the most part, not crashlooping their entire install base.

My Tesla remote updates ... hmph.

It doesn't feel like this is inherently impossible. It feels more like not enough design/process to mitigate the risks.

7952
1 replies
9h28m

In a world of complex systems a "confluence of improbable events" is the same thing as "a disaster waiting to happen". Its the swiss cheese model of failure. Y

k8sToGo
0 replies
8h33m

Every system can only survive so many improbable events. Even in aviation.

YZF
0 replies
12h8m

It seems like a none of the above situation because each of those should have really minimized the chances of something like this happening. But this is pure speculation. Even the most perfect organization engineering culture can still have one thing get through... (Wasn't there some Linux incident a little back though?)

Quality starts with good design, good people, etc. the process parts come much after that. I'd like to think that if you do this "right" then this sort of stuff simply can't happen.

If we have organization/culture/engineering/process issues then we're likely not going to get an in-depth public most-mortem. I'd love to get one just for all of us to learn from it. Let's see. Given the cost/impact having something like the Challenger investigation with some smart uninvolved people would be good.

blirio
23 replies
17h6m

So is unmapped address another way of saying null pointer?

two_handfuls
18 replies
17h5m

It’s an invalid pointer yes, but it doesn’t say whether it’s null specifically.

jeffbee
8 replies
16h46m

"Attempt to read from address 0x9c" doesn't strike me as "null pointer". It's an invalid address and it doesn't really matter if it was null or not.

jmb99
2 replies
14h30m

As an example to illustrate the sibling comments’ explanations:

int *array = NULL

int position = 0x9C

int a = *(array[pos]) //equivalent to *(array + 0x9C) - dereferencing NULL+0x9C, which is just 0x9C

This will segfault (or equivalent) due to reading invalid memory at address 0x9C. Most people would call array[pos] a null pointer dereference casually, even though it’s actually a 0x9C pointer dereference, because there’s very little effective difference between them.

Now, whether this case was actually something like this (dereferencing some element of a null array pointer) or something like type confusion (value 0x9C was supposed to be loaded into an int, or char, or some other non-pointer type) isn’t clear to me. But I haven’t dug into it really, someone smarter than me could probably figure out which it is.

jeffbee
0 replies
3h2m

What we are witnessing quite starkly in this thread is that the majority of HN commenters are the kinds of people exposed to anti-woke/DEI culture warriors on Twitter.

GeneralMayhem
1 replies
16h44m

0x9c (156 dec) is still a very small number, all things considered. To me that sounds like attempting to access an offset from null - for instance, using a null pointer to a struct type, and trying to access one of its member fields.

Aloisius
0 replies
16h10m

Could just as easily be accessing an uninitialized pointer, especially given there is a null check immediately before.

stravant
0 replies
13h54m

Such an invalid access of a very small address probably does result from a nullptr error:

    struct BigObject {
        char stuff[0x9c]; // random fields
        int field;
    }
    BigObject* object = nullptr;
    printf("%d", object->field);
That will result in "Attempt to read from address 0x9c". Just because it's not trying to read from literal address 0x0 doesn't mean it's not nullptr error.

loeg
0 replies
15h12m

It is pretty common for null pointers to structures to have members dereferenced at small offsets, and people usually consider those null dereferences despite not literally being 0. (However, the assembly generated in this case does not match that access pattern, and in fact there was an explicit null check before the dereference.)

Dwedit
0 replies
15h57m

9C means that it's a NULL address plus some offset of 9C. Like a particular field of a struct.

phire
0 replies
12h44m

Probably not.

R8 is 0x9c in that example, which is somewhat typical for null+offset, but in the twitter thread it's 0xffff9c8e0000008a.

So the actual bug is further back. It's not a null pointer dereference, but it somehow results in the mov r8, [rax+r11*8] instruction reading random data (could be anything) into r8, which then gets used as a pointer.

Maybe this is a use-after-free?

blirio
6 replies
16h57m

Oh wait, I just remembered null is normally 0 in C and C++. So probably not that if it is not 0.

chongli
2 replies
15h21m

NULL isn't always the integer 0 in C. It's implementation-defined.

loeg
1 replies
15h11m

In every real world implementation anyone cares about, it's zero. Also I believe it is defined to compare equal to zero in the standard, but don't quote me on that.

tzs
0 replies
13h46m

Also I believe it is defined to compare equal to zero in the standard, but don't quote me on that.

That's true for the literal constant 0. For 0 in a variable it is not necessarily true. Basically when a literal 0 is assigned to a pointer or compared to a pointer the compiler takes that 0 to mean whatever bit pattern represents the null pointer on the target system.

taspeotis
1 replies
16h42m

What? If you have a null pointer to a class, and try to reference the member that starts 156 bytes from the start of the class, you’ll deference 0x9c (0 + 156)

emmelaich
0 replies
15h14m

Strangely, not necessarily on every implementation on every processor.

It's not guaranteed that NULL is 0.

Still, I don't think you'd find a counterexample in the wild these days.

cmpxchg8b
0 replies
13h50m

If you have a page mapped at address 0, accessing address 0 is valid.

leeter
1 replies
15h45m

No this is kernelspace, an so while all addresses are 'virtual' an unmapped address is an address that hasn't been mapped in the page tables. Normally critical kernel drivers and data are marked as non-pagable (note: The Linux Kernel doesn't page, NTKernel does a legacy of when it was first written and memory constraints of the time). So if a driver needs to access pagable data it must not be part of the storage flow (and Crowdstrike is almost certainly part of it), and at the correct IRQL (the Interrupt priority level, anything above dispatch, AKA the scheduler, has severe restraints on what can happen there).

So no an unmapped address is a completely different BSOD, usually PAGE_FAULT_IN_UNPAGED_AREA which is a very bad sign

jkrejcha
0 replies
13h21m

PAGE_FAULT_IN_NONPAGED_AREA[1]... was the BSOD that occurred in this case. That's basically the first sign that it was a bad pointer dereference in the first place.

(DRIVER_)IRQL_NOT_LESS_OR_EQUAL[2][3] is not this case, but it's probably one of the most common reasons drivers crash the system generally. Like you said it's basically attempting to access pageable memory at a time that paging isn't allowed (i.e. when at DISPATCH_LEVEL or higher).

[1]: https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[2]: https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[3]: https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

loeg
0 replies
15h14m

No; lots of virtual addresses are not mapped. Null is a subset of all unmapped addresses.

iwontberude
15 replies
12h20m

Crowdstrike isn’t a company anymore, this is probably their end. The litigation will be death by thousand cuts.

asynchronous
2 replies
11h29m

They really are delusional, as a security person crowdstrike was overvalued before this event, and to everyone in tech this shows how bad their engineering practices are.

chii
1 replies
11h25m

but they are able to insert themselves into this many enterprise machines! So regardless of your security credentials, they made good business decisions.

On the other hand, this may open the veil for a lot of companies to dump them.

bni
0 replies
10h43m

For another similar product from a competitor that there is no reason to believe are any better.

Osiris
1 replies
11h28m

Wow. Cause a global meltdown and only lose 18% of your stock value? They must be doing something that investors like.

imtringued
0 replies
10h57m

They are probably pivoting to charging ransoms aka "consulting fees" to fix crashing systems and those are priced in.

aflag
0 replies
8h11m

The stock market only had a day to react and they were also heavily affected by the issue. Let's see where the stock price goes in the following week.

t0mas88
5 replies
12h11m

Has anyone looked into their terms and conditions? Usually any resulting damage from software malfunctioning is excluded. Only the software itself being unavailable may be an SLA breach.

Typically there would also be some clauses where CS is the only one that is allowed to determine an SLA breach, SLA breaches only result in future licence credits no cash, and if you disagree it's limited to mandatory arbitration...

The biggest impact is probably only their reputation taking a huge hit. Loosing some customers over this and making it harder to win future business.

iwontberude
3 replies
12h8m

They will still need to hire lawyers to prove this. Thousands of litigants. I am sure there is some tort which is not covered by the arbitration agreement that would give plaintiff standing no?

Commenter on stack exchange had an interesting counter: In some jurisdictions, any attempt to sidestep consumer law may be interpreted by the courts as conspiracy, which can prove more serious than merely accepting the original penalties.

chii
2 replies
11h26m

Thousands of litigants

i would imagine a class action suit instead of individual cases if this were to happen.

iwontberude
0 replies
11h2m

Potentially we will see some, but this occurred in many jurisdictions across the world.

disgruntledphd2
0 replies
9h31m

They'll be sued by the insurance companies probably.

clwg
0 replies
8h41m

No big company is going to agree to the terms and conditions that are listed on their website, they'll have their own schedules for indemnification that CS would agree to, not the other way around. Those 300 of the Fortune 500 companies are going to rip CS apart.

markus_zhang
0 replies
5h14m

I'd bet $100 that Crowdstrike won't pay out more than $100m for that dozens of billions of damage.

ai4ever
0 replies
1h42m

software vendors should be required to face consequences of shipping a poor product.

one possibility is: clawback or refunds for past payments equal to business damage caused by the flawed product.

Gazoche
14 replies
4h57m

What really blew my mind about this story is learning that a single company (CrowdStrike) has the power to push random kernel code to a large part of the world's IT infrastructure, at any time, at their will.

Correct me if I'm wrong but isn't kernel-level access essentially God Mode on every computer their software is installed on? Including spying on the entire memory, running any code, deleting data, installing ransomware? This feels like an insane amount of power concentrated into the hands of a single entity, on the level of a nuclear submarine. Wouldn't that make them a prime target for all sorts of nation-state actors?

This time the damage was (likely) unintentional and no data was lost (save for lost BitLocker keys), but were we really all this time one compromised employee away from the largest-ever ransomware attack, or even worse?

andix
3 replies
4h43m

It's not perfectly clear yet if CrowdStrike is able to push executable code via those updates. It looks like they updated some definition files and not the kernel driver itself.

But the kernel driver obviously contains some bugs, so it's possible that those definition updates can inject code. There might be a bug inside the driver that allows code execution (it happens all the time that some file parsing code can be tricked into executing parts of the data). I'm not sure, but I guess a lot of kernel memory is not fully protected by NX bits.

I still have the gut feeling, that this incident was connected to some kind of attack. Maybe a distraction from another attack while everyone is busy about fixing all the clients. During this incident security measures were for sure lowered, lists with BitLocker keys printed out for service technicians to fix the systems. Even the fix itself was to remove some parts of the CroudStrike protection. I would really like to know what was inside the C-00000291*.sys file before the update replaced it with all zeros. Maybe it was a cleanup job to remove something concerning that went wrong. But Hanlon's razor tells me not to trust my gut: "Never attribute to malice that which is adequately explained by stupidity."

milkshakes
1 replies
3h21m

falcon absolutely has a remote code execution function as a part of Falcon Response

andix
0 replies
3h14m

So CrowdStrike has direct access to a lot of critical infrastructure? LOL.

neom
0 replies
4h32m

For what it's worth, I 10000% agree with your gut feeling, and mine is a gut feeling too so I didn't mention it on HN because we typically don't talk about these types of guts feelings because of the directions they become speculative in (+the razor), but what you wrote is exactly what is in my head, fwiw.

neom
2 replies
4h37m

Well kernel agents and drivers are not uncommon, however anyone doing anything at scale where there is anything touching a kernel is typically well understood in the system you're implementing it on. That aside, I gather from skimming around (so might be wrong here) - seems people were specifically implementing this because of a business case not a technical case, I read it's mostly used to create compliance (I think via shifted liability) - so I think it was probably too easy to happen and so it happened - in that - someone in the bizniz dept said "if we run this software we are compliant with whatever, enabling XYZ multiple of new revenue, clear business case!!!" and the tech people probably went "bizniz people want this, bizniz case is clear, this seems like a relatively advanced business who know what they're doing, it doesn't really do much on my system and I'm mostly deploying it to innocuous edge user systems, so seems fine shrug" - and then a bad push happened and lots and lots of IT departments had had the same convo aforementioned.

Could be wrong here so if anyone knows better and can correct me...plz do!

lyu07282
0 replies
3h3m

implementing this because of a business case not a technical case

there are some certification requirements to do pentests/red teaming and then those security folk will all tell them to install an EDR so they picked crowdstrike, but the security people have a very valid technical case for that recommendation.

it doesn't shift liability to crowdstrike, thats not how this works. In this specific case they are very likely liable due to gross negligence, but that is different

hello_moto
0 replies
4h12m

A lot of people, especially the non cybersecurity ones, are way off the mark so you're not the only one.

SoftTalker
1 replies
3h13m

The OS vendors themselves (Microsoft, Apple, all the linux distros) have this power as well via their automatic update channels. As do many others who have automatically-updating applications. So it's not a single company, it's many companies.

Gazoche
0 replies
2h51m

That's true; I suppose it doesn't feel as bad because they're much larger companies and more in the public's eye. It's still scary to think about the amount of power they yield.

wellknownfakts
0 replies
3h43m

It is a well known fact that these companies who hold huge sway on the world's IT landscape are commonly infiltrated at the top levels by Intel agents.

vimbtw
0 replies
3h7m

This is the mini existential crisis I have randomly. The attack area for a modern IT computer is mind bogglingly massive. Computers are pulling and executing code from a vast array of “trusted” sources without a sandbox. If any one of those “trusted” sources are compromised (package managers, cdns, OS updates, security software updates, just app updates in general, even specific utilities like xz) then you’re absolutely screwed.

It’s hard not to be a little nihilistic about security.

nightowl_games
0 replies
4h20m

no data was lost

Data was lost in the knock on effects of this, I assure you.

largest-ever ransomware attack

A ransomware attack would be a terrible use of this power. A terrorist attack or cover while a country invades another country is a more appropriate scale of potential damage here. Perhaps even worse.

Shorel
0 replies
3h12m

What blew my mind is that a single company has such a good sales team to sell an unnecessary product to a large part of the world's IT.

And if any part of it is necessary, then that's a failure of the operating system. It should be a feature of Active Directory or Windows.

So, great job sales team, you earned your commissions, now get ready to jump ship, 'cause this one is sinking.

ChoGGi
0 replies
3h20m

"What really blew my mind about this story is learning that a single company (CrowdStrike) has the power to push random kernel code to a large part of the world's IT infrastructure, at any time, at their will."

Isn't that every antivirus software and game anticheat?

brcmthrowaway
9 replies
15h13m

How did it pass CI?

voidfunc
4 replies
15h8m

I suspect some engineer has discovered their CI scripts were just "exit 0"

01HNNWZ0MV43FF
2 replies
14h49m

Ah, the French mutation testing. Has never been celebrated for its excellence. </orson>

dehugger
1 replies
14h44m

What is French mutation testing? A casual kagi seems to imply its a type of genetic testing, or perhaps just tests that have been done in France?

zerocrates
0 replies
14h15m

They're referencing an (in)famous video of a drunk/drugged/tired Orson Welles attempting to do a commercial; his line is "Ahhh, the... French... champagne has always been celebrated for its excellence..."

I don't think there's anything more to the inclusion of "French" in their comment beyond it being in the original line.

https://www.youtube.com/watch?v=VFevH5vP32s

and the successful version: https://www.youtube.com/watch?v=qb1KndrrXsY

Too
0 replies
12h22m

lol, I’ve lost count of how many CI systems I’ve seen that are essentially no-ops, letting through all errors, because somewhere there was a bash script without set -o errexit.

Osiris
1 replies
11h30m

It wasn't a code update. It was a data file update. It certain seems that they don't include adequate testing for data file updates.

bni
0 replies
10h53m

In my experience, testing data and config is very rare in the whole industry. Feeding software corrupted config files or corrupted content from its own database often makes software to crash. Most often this content is "trusted" to be "correct".

xyst
0 replies
14h9m

Bold of you to assume there is CI to begin with

Taniwha
8 replies
13h44m

Really the underlying problem here is that their software is loading external data into their kernel driver and not correctly sanitising their inputs

xvector
5 replies
13h31m

I find it absolutely insane they wouldn't be doing this. At the level their software operates, it's sheer negligence to not sanitize inputs.

blackeyeblitzar
4 replies
13h27m

I wonder if it’s for performance reasons.

silisili
0 replies
13h5m

I'm not overly familiar with crowdstrike processes, but assume they are long running. If it's all loaded to memory, eg a config, I can't see how you'd get any performance gain at all. It just seems lazy.

prisenco
0 replies
13h24m

Maybe, maybe, but if it's not in a hot loop, why would the performance gain be worth it?

dboreham
0 replies
12h10m

It's for incompetence reasons.

0xDEADFED5
0 replies
12h22m

wild speculation aside, i'd say a little less performance is preferable to this outcome.

Taniwha
1 replies
13h13m

The other issue is that they push to everyone - as someone who at my last job had a million boxes in the wild, and was very aware that bricking them all would kill the company we would NEVER push them all at once, we'd push a few 'friends and family' (ie practice each release on ourselves first), then do a few % of the customer base and wait for problems, then maybe 10%, wait again, then the rest.

Of course we didn't have had any third party loading code into our boxes out of our control (and we run linux)

szundi
0 replies
13h0m

Same here. Also before the first phase, we test wether we can remote downgrade after upgrade.

Anonymityisdead
7 replies
15h34m

Where is a good place and way to start practicing disassembly in 2024?

nophunphil
0 replies
15h30m

Take this with a grain of salt as I’m not an SME, but there is a need for volunteers on reverse-engineering projects such as the Zelda decompilation projects[1]. This would probably give you some level of exposure, particularly if you have an interest in videogames.

[1] https://zelda64.dev/

mauvia
0 replies
13h34m

first you need to learn assembly, second you can start by downloading ghidra and directly start decompiling some simple things you use and seeing what they do.

commandersaki
0 replies
14h15m

I found https://pwn.college to be excellent, even though they mostly focus on exploitation, pretty much everything involves disassembly.

Scene_Cast2
0 replies
15h21m

Try solving some crackme's. They're binary executables of various difficulty (with rated difficulty), where the goal ranges from finding a hardcoded password to making a keygen to patching the executable. They used to be more popular, but I'm guessing you can still find tutorials on how to get started and solve a simple one.

CodeArtisan
0 replies
11h33m

As a very first step, you may start playing with https://godbolt.org/ to see how code is translated into lower-level instructions.

13of40
0 replies
13h52m

Writing your own simple programs and debugging/disassembling them is a solid option. Windbg and Ida are good tools to start with. Reading a disassembly is a lot easier than coding in assembly, and once you know what things like function calls and switch statements, etc. look like you can get a feel for what the original program was doing.

0xDEADFED5
0 replies
12h18m

you can compile your own hello world and look at the executable with x64dbg. press space on any instruction and you can assemble your own instruction in it's place (optionally filling the leftover bytes with NOPs)

canistel
5 replies
16h15m

Out of curiosity: In the old days, SoftIce could have been used which was a kernel mode debugger. What tool can be used these days?

Dwedit
2 replies
15h59m

You'd use WinDBG today. It allows you to do remote kernel debugging over a network. This also includes running Windows in a virtual machine, and debugging it through the private network connection.

gonesilent
1 replies
13h43m

FireWire is also still used to dump out kernel debug.

the8472
0 replies
6h27m

Shouldn't IOMMUs block that these days?

mauvehaus
0 replies
16h6m

SoftIce predates me, but when I was doing filesystem filter driver work, the tool of choice was WinDbg. Been out of the trade for a bit, but it looks to still be in use. We had it set up between a couple of VMs on VMware.

hannasm
3 replies
11h34m

Do these customers of crowd strike even have a say in these updates going out or do they all just bend over and let crowd strike have full RCE on every machine in their enterprise.

I sure hope the certificate authorities and other crypto folks get to keep that stuff off their systems at least.

raincole
0 replies
10h55m

In our lifetime we'll see an auto update to self-driving cars that kills millions.

Well it's likely we don't see that because we might be one of the millions.

Centigonal
0 replies
11h24m

I don't know if there's a way to outsource ongoing endpoint security to a third party like Crowdstrike without giving them RCE (and ring 0 too) on all endpoints to be secured. Having Crowdstrike automate that part is kind of the point of their product.

andix
3 replies
15h50m

How sure are we, that this was not a cyberattack?

It seems really scary to me, that crowdstrike is able to push updates in real time to most of their customers systems. I don't know of any other system, that would provide a similar method to inject code at kernel level. Not even windows updates, as they always roll out with some delay and not to all computers at the same time

If you want to attack high profile systems, crowdstrike would be one of the best possible targets.

Grimblewald
2 replies
11h46m

The amount of self pwning that goes on in both corporate and personal devices these days is insane. The amount of games that want you to install kernal level anti-cheat is astounding. The amount of companies that have centralized remote surveillance and control of all devices, where access to this is through a great number of sloppily managed accounts, is beyond spooky.

padjo
0 replies
11h1m

I mean centralized control of devices is great for the far more common occurrence of Bob from accounting leaving his laptop on the train with his password on post-it note stuck to the screen.

andix
0 replies
5h44m

Exactly. It's ridiculous to open up all/most of a companies systems to such a single point of failure. We install redundant PSUs, backup networks, generators, and many more things. But one single automatic update can bring down all systems within minutes. Without any redundancy.

taormina
2 replies
3h40m

Imagine if Microsoft sold you a secure operation system like Apple. A staggering portion of the existing cybersecurity industry would be irrelevant if this ever happened.

natdempk
1 replies
3h10m

Most enterprises these days also run stuff like Crowdstrike (or literally Crowdstrike) on their macOS deployments. Similarly Windows these days is bundled with OS-level antivirus which is sufficient for non-enterprise users.

Not in the security industry, but my take is that basically the desktop OS permissions and security model is wrong for a lot of these devices, but there is no alternative that is suitable or that companies are willing to invest in. Probably many of the highest-profile affected machines (airport terminals, signage, medical systems, etc.) should just resemble a phone/iPad/Chromebook in terms of security/trust, but for historical/cost/practical reasons are Windows PCs with Crowdstrike.

kchr
0 replies
1h19m

CrowdStrike uses eBPF on Linux and System Extensions on macOS. Neither if which need kernel level presence. Microsoft should move towards offering these kind of solutions to make AV and EDR more resistent on Windows devices, without jeopardising system integrity and availability.

mianos
2 replies
14h34m

A 'channel file' is a file interpreted by their signature detection system. How far is this from a bytecode compiled domain specific language? Javascript anyone?

eBPF, much the same thing, is actually thought about and well designed. If it wasn't it would be easy to crash linux.

This is what they do and they are doing badly. I bet it's just shit on shit under the hood, developed by somewhat competent engineers, all gone or promoted to management.

broknbottle
1 replies
14h8m

Oddly enough, there was an issue last month with CrowdStrike and RHEL 9 kernel where they were triggering a kernel panic when attempting to load a bpf program from their newer bpf sensor. One of the workarounds was to switch to their kernel driver mode.

This was obviously a bug in RHEL kernel because even if the bpf program was bunk it should not cause the kernel to panic. However, it's almost like CrowdStrike does zero testing of their software and looks at their end users as Test/QA.

https://access.redhat.com/solutions/7068083

4bb7ea946a37 bpf: fix precision backtracking instruction iteration
CaliforniaKarl
0 replies
9h3m

The kernel update in question was released as part of a RHEL point release (9.3 or 9.4, I forget which).

I’m not sure how much early warning RH gives to folks when a kernel change comes in via a point release. Looking at https://www.redhat.com/en/blog/upcoming-improvements-red-hat..., it seems like it’s changing for 9.5. I hope CrowdStrike will be able to start testing against those beta kernels.

anothername12
2 replies
12h43m

I found windows confusing. In Linux speak, was this some kind of kernel module thing that CS installed? It’s all I can think of for why the machines BSOD

G3rn0ti
1 replies
12h32m

It was a binary data file (supposedly invalid) that caused the actual CS driver component to BSOD. However, they used the „sys“ suffix to make it look just like a driver supposedly to get Windows protection from a malicious actor to just delete it. AFAIU.

stevekemp
0 replies
11h50m

Windows filesystem protection doesn't rely upon the filename, but on the location.

They could have named their files "foo.cfg", "foo.dat", "foo.bla" and been equally protected.

The use of ".sys" here is probably related to the fact it is used by their system driver. I don't think anybody was trying to pretend the files there are system drivers themselves, and a quick look at the exports/disassembly would make that apparent anyway.

wasabinator
1 replies
8h0m

I wonder what privilege level this service runs at. If it's less than ring 0, i think some blame needs to go to Windows itself. If it's ring 0, did it really need to be that high??

Surely an OS doesn't have to go completely kaput due to one service crashing.

Kwpolska
0 replies
4h23m

It's not a service, it's a driver. "Anti"malware drivers typically run with a lot of permissions to allow spying on all processes. Driver failures likely mean the kernel state is borked as well, so Windows errs on the side of caution and halts.

meindnoch
1 replies
1h16m

Because it wasn't written in Rust!

hatsunearu
1 replies
10h32m

So was the totally empty channel file just a red herring?

Kwpolska
0 replies
4h18m

I think the file with all zeros was the fix that CS pushed out after they learned of their mistake.

hatsunearu
1 replies
10h2m

I see a paradox that the null bytes are "not related" to the current situation and yet deleting the file seems to cure the issue. Perhaps the CS official statement that "This is not related to null bytes contained within Channel File 291 or any other Channel File." is poorly worded.

My opinion is that CS is trying to say the null bytes themselves aren't the actual root cause of the issue, but merely a trigger for the actual root cause, which is that CSAgent.sys has a problem where malformed input vectors can cause it to crash. Well designed programs should error out gracefully for foreseeable errors, like corrupted config files.

If we interpret that quoted sentence such that "this" is referring to "the logical error", and that "the logical error" is the error in CSAgent.sys that causes it to crash upon reading a bad channel file, then that statement makes sense.

This is a bit of a stretch, but so far my impression with CS corporate communication regarding this issue has been nothing but abject chaos, so this is totally on-brand for them.

chrisjj
0 replies
8h25m

My opinion is that CS is trying to say the null bytes themselves aren't the actual root cause of the issue, but merely a trigger for the actual root cause,

My opinion is they say "unrelated" because they are trying to say unrelated - and hence no, this was not a trigger.

flappyeagle
1 replies
4h35m

The only thing I know about crowdstrike is they hired a large percentage of the underperforming engineers we fired at multiple companies I’ve worked at

codeulike
1 replies
11h22m

'Analysis' of the null pointer is completely missing the point. The simple fact of the matter is they didnt do anywhere near enough testing before pushing the files out. Auto update comes with big responsibility, this was criminally reckless

CaliforniaKarl
0 replies
9h0m

There are enough people in the world that some can examine how this happened while others simultaneously examine why this happened.

webprofusion
0 replies
10h11m

The girl on the supermarket checkout said she hoped her computer wouldn't be affected. I knowingly laughed and said "you probably don't have on your own computer unless your a bank".

She said, "I installed it before for my cybersecurity course but I think it was just a trial"

Assumptions eh.

throwyhrowghjj
0 replies
9h51m

This is a pretty brief 'analysis'. The poster traces back one stack frame in assembler, it basically amounts to just reading out a stack dump from gdb. It's a good starting point I guess.

system2
0 replies
12h15m

Maybe one day people will learn what a blog is.

switch007
0 replies
8h33m

Is there commercial pressure to push out "content" updates asap so you can say you're quicker than your competition at responding to emerging threats?

peter_retief
0 replies
9h59m

I don't do windows either.

ok123456
0 replies
1h48m

When your snake oil is poisonous.

nesas
0 replies
3h53m

Nesa

mkl95
0 replies
11h21m

How feasible would it be to implement blue green deployments in that kind of system?

minhoryang
0 replies
8h27m

Can we find an uptime(availability) graph for the CrowdStrike agent? Don't you think this graph should be included in the postmortem?

m0llusk
0 replies
15h21m

Ended up being forced because it was a "content update". This is the update of our discontent!

heraldgeezer
0 replies
8h56m

How strange to cite ResetEra, a gaming forum with a significant certain community, and may not be considered a reliable source.

donatj
0 replies
10h30m

I am genuinely curious what their CI process that passed this looks like, as well as if they're doing any sort of dogfooding or manual QA? Are changes just CI/CD'd out to production right away?

dallas
0 replies
6h6m

Those who have spent time writing NDIS/TDI drivers are those who know the minefield!

cybervegan
0 replies
2h34m

Boy is crowdstrike's software going to get seriously fuzz tested now. All their vulns will be on public display in the next week or so.

cedws
0 replies
5h17m

Does anybody know if these “channel files” are signed and verified by the CS driver? Because if not, that seems like a gaping hole for a ring 0 rootkit. Yeah, you need privileges to install the channel files, but once you have it you can hide yourself much deeper in the system. If the channel files can cause a segfault, they can probably do more.

Any input for something that runs at such high privilege should be at least integrity checked. That’s the basics.

And the fact that you can simply delete these channel files suggests there isn’t even an anti-tamper mechanism.

calrain
0 replies
5h13m

This reminds me of the vulnerability that hit jwt tokens a few years ago, when you could set the 'alg' to 'none'.

Surely CrowdStrike encrypts and signs their channel files, and I'm wondering if a file full of 0's inadvertently signaled to the validating software than a 'null' or 'none' encryption algo was being used.

This could imply the file full of zeros is just fine, as the null encryption passes, because it's not encrypted.

That could explain why it tried to reference the null memory location, because the null encryption file full of zeroes just forced it to run to memory location zero.

The risk is, if this is true, then their channel loading verification system is critically exposed by being able to load malicious channel drivers through disabled encryption on channel files.

Just a hunch.

ai4ever
0 replies
1h51m

why is openai/anthropic letting this crisis go to waste ?

where are tweets from sama and amodei on how agi is going to fix these issues ?

JSDevOps
0 replies
11h30m

Hasn’t this been debunked?