The moment I read 'it is a content update that causes the BSOD, deleting it solves the problem', I was immediately willing to bet a hundred quid (for the non-British, that's £100) that it was a combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data (in this case, read an array of pointers, didn't verify that all of them were both non-null and pointed to valid data/code).
In the past ten years or so of having done somewhat serious computing and zero cybersecurity whatsoever, I have my mind concluded, feel free to disagree.
Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.
This includes everything from: decompression algorithms; font outline readers; image, video, and audio parsers; video game data parsers; XML and HTML parsers; the various certificate/signature/key parsers in OpenSSL (and derivatives); and now, this CrowdStrike content parser in its EDR program.
That wager stands, by the way, and I'm happy to up the ante by £50 to account for my second theory.
There's at least five different things that went wrong simultaneously.
1. Poorly written code in the kernel module crashed the whole OS, and kept trying to parse the corrupted files, causing a boot loop. Instead of handling the error gracefully and deleting/marking the files as corrupt.
2. Either the corrupted files slipped through internal testing, or there is no internal testing.
3. Individual settings for when to apply such updates were apparently ignored. It's unclear whether this was a glitch or standard practice. Either way I consider it a bug(it's just a matter of whether it's a software bug or a bug in their procedures).
4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.
5. Whatever caused the corruption in the first place, which is anyone's guess.
Number 4 continues to be the most surprising bit to me. I could not fathom having a process that involves deploying to 8.5 million remote machines simultaneously.
Bugs in code I can almost always understand and forgive, even the ones that seem like they’d be obvious with hindsight. But this is just an egregious lack of the most basic rollout standards.
They probably don't get to claim agile story points until the ticket is in finished state. And they probably have a culture where vanity Metrics like "velocity" are prioritized
This would answer the question that i've not heard anyone asking:
what incentivized the bad decisions that led to this catastrophic failure?
My understanding is that the culture (as reported by some customers) is quite aggressive and pushy. They are quite vocal when customers don’t turn in automatic updates.
It makes sense in a way - given their fast growth strategy (from nowhere to top 3) and desire to “do things differently” - the iconoclast upstarts that redefine the industry.
Or to summarise - hubris.
I'm sorry but this is the customer's fault.
If I'm using your services you work for me and you don't get to bully me into doing whatever you think needs to be done.
People that chose this solution need to be penalized, but they won't.
Customers don’t always have a choice here. They could be restricted by compliance programs (PCI, et al) and be required under those terms to have auto updates on.
Compliance also has to share some of the blame here, if best practices (local testing) aren’t allowed to be followed in the name of “security”.
This needs to keep being repeated anytime someone wants to blame the company.
Many don’t have a choice, a lot of compliance is doing x to satisfy a checkbox and you don’t have a lot of flexibility in that or you may not be able to things like process credit cards which is kinda unacceptable depending on your company. (Note: I didn’t say all)
CrowdStrike automatic update happens to satisfy some of those checkboxes.
To catch 0day quickly, EDR needs to know "how".
The "how" here is AV definition or a way to identify the attack. In CS-speak: content.
Catching 0day quickly results in good reputation that your EDR works well.
If people turn off their AV definition auto-update, they are at-risk. Why use EDR if folks don't want to stop attack quickly?
Oh the games I have to play with story points that have personal performance metrics attached to them. Splitting tickets to span sprints so there aren’t holes in some dudes “effort” because they didn’t compete some task they committed to.
I never thought such stories were real until I encountered them…
Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.
But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.
On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.
I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?
In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.
In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.
Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.
I don't know how you could assert that this is impossible, hence channel files should be treated as code.
Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.
I think point 3 of the grand parent indicates admins were not given an opportunity to test this.
My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)
Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.
Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.
Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.
It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.
Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.
At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.
For me, number 1 is the worst of the bunch. You should always expect that there will be bugs in processes, input files, etc… the fact that their code wasn’t robust enough to recognize a corrupted file and not crash is inexcusable. Especially in kernel code that is so widely deployed.
If any one of the five points above hadn’t happened, this event would have been avoided. However, if number 1 had been addressed - any of the others could have happened (or all at the same time) and it would have been fine.
I understand that we should assume that bugs will be present anywhere, which is why staggered deployments are also important. If there had been staggered deployments, the. The damage would have happened, but it would have been localized. I think security people would argue against a staged deployment though, as if it were discovered what the new definitions protected against, an exploit could be developed quickly to put those servers that aren’t in the “canary” group at risk. (At least in theory — I can’t see how staggering deployment over a 6-12 hour window would have been that risky).
They're all terrible, but I agree #1 is particularly egregious for a company ostensibly dedicated to security. A simple fuzz tester would have caught this type of bug, so they clearly don't perform even a minimal amount of testing on their code.
And here I thought shipping a new version on the app store was scary.
Is there anything we can take from other professions/tradecraft/unions/legislation to ensure shops can’t skip the basic best practices we are aware of in the industry like staged rollouts? How do we set un incentives to prevent this? Seriously the App Store was raking in $$ from us for years with no support for staged rollouts and no other options.
Malware signature updates are supposed to be deployed ASAP, because every minute may count when a new attack is spreading. The mistake may have been to apply that policy indiscriminately.
A lot of snarky replies to this comment, but the reality is that if you were selling an anti-virus, identified a malicious virus, and then chose not to update millions of your machines with that virus’s signature, you’d also be in the wrong.
Zero effort to fuzz test the parser too. I mean, we know how to harden parsers against bugs and attacks, and any semi-competent fuzzer would have caught such a trivial bug.
The triggering file was all zeros.
It is not possible that only this pattern caused the crash, and fuzzing omitted to try this unfuzzy pattern?
No, it wasn't. Crowdstrike denied it had to do with zeros in the files.
At this point I wouldn't be paying too much attention to what Crowdstrike is saying.
Have to speak the truth albeit at minimum, in case legal...
Which also explains why they, only if needed to cover their back legally, confirm or deny details being shared on social and mass media.
Competent fuzzers don't just use random bytes, they systematically explore the state-space of the target program. If there's a crash state to be found by feeding in a file full of null bytes, it's probably going to be found quickly.
A fun example is that if you point AFL at a JPEG parser, it will eventually "learn" to produce valid JPEG files as test cases, without ever having been told what JPEG file is supposed to look like. https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
AFL is really "magical". It finds bugs very quickly and with little effort on our part except to leave it running and look at the results occasionally. We use it to fuzz test a variety of file formats and network interfaces, including QEMU image parsing, nbdkit, libnbd, hivex. We also use clang's libfuzzer with QEMU which is another good fuzzing solution. There's really no excuse for CrowdStrike not to have been using fuzzing.
Possible? Yes. Likely? No.
The files in question has a magic number is "0xAAAAAAAA" so it is not possible that the file was all zeros.
In my limited experience, I thought any serious fuzzing program does test for all "standard" patters like only null bytes, empty strings, etc...
No, it wasn’t all zeros: https://x.com/patrickwardle/status/1814782404583936170
Instrumented fuzzing (like AFL and friends) tweaks the input to traverse unseen code paths in the target, so they're super quick to find stuff like "heyyyyy, nobody is actually checking if this offset is in bounds before loading from that address".
AV software is a great target for malware, badly written, probably runs too much stuff in the kernel, tries to parse everything
And at the very least straight to system level access if not more.
AV software needs kernel privilidges to have access to everything it needs to inspect, but the actual inspection of that data should be done with no privilidges.
I think most AV companies now have a helper process to do that.
If you successfully exploit the helper process, the worst damage you ought to be able to do is falsely find files to be clean.
Anti-cheats also whitelist legit AV drivers, even though cheaters exploit them to no end.
You are seriously overestimating the engineering practises at these companies. I have worked in "enterprise security" previously though not at this scale. In a previous life I worked with of the engineering leaders currently at Crowdstrike.
I'll bet you this company has some arbitrary unit test coverage requirements for PRs which developers game be mocking the heck out of dependencies. I am sure they have some vanity sonarqube integration to ensure great "code quality". This likely also went through manual QA.
However I am sure the topic of fuzz testing would not have come up once. These companies sell checkbox compliance, and they themselves develop their software the same way. Checking all the "quality engineering" boxes with very little regards for long term engineering initiatives that would provide real value.
And I am not trying to kick Crowdstrike when they are down. It's the state of any software company run by suits with myopic vision. Their engineering blogs and their codebases are poles apart.
Bugs happen.
Not staggering the updates is what blew my mind.
Since the issue manifested at 04:09 UTC, which is 11pm where Crowdstrikes HQ is, I would guess someone was working late at night and skipped the proper process so they could get the update done and go to bed.
They probably considered it low risk, had done similar things of times hundreds of times before, etc.
Companies these days are global btw.
Not everyone is working on the same timezone.
They don't appear to have engineering jobs in any location where that would be considered regular office hours...
https://crowdstrike.wd5.myworkdayjobs.com/crowdstrikecareers
I see remote, Israel, Canada.
https://crowdstrike.wd5.myworkdayjobs.com/en-US/crowdstrikec...
This one specifically Spain and Romania
I know they bought companies all over the globe from Denmark to other locations.
0409UTC is 07:09 AM in Israel. Doubt an engineer was doing a push then either...
All the other engineering locations seem even less likely.
On Friday, no less. (Israel's weekend is Friday / Saturday instead of the usual Saturday / Sunday.)
A good reminder of the fact that your Thursday might be someone else's Friday.
Wild that anyone would consider anything in the “critical path” low risk. I would bet that they just don’t do rolling releases normally since it never caused issues before.
I’d also maybe add another one on the Windows end:
6) some form of sandboxing/error handling/api changes to make it possible to write safer kernel modules (not sure if it already exists and was just not used). It seems like the design could be better if a bad kernel module can cause a boot loop in the OS…
It’s a tough problem, because you also don’t want the system to start without the CrowdStrike protection. Or more generally, a kernel driver is supposedly installed for a reason, and presumably you don’t want to keep the system running if it doesn’t work. So the alternative would be to shut down the system upon detection of the faulty driver without rebooting, which wouldn’t be much of an improvement in the present case.
I can imagine better defaults. Assuming the threat vector is malicious programs running in userspace (probably malicious programs in kernel space is game over anyway right?), then you could simply boot into safe mode or something instead of crashlooping.
One of the problems with this outage was that you couldn’t even boot into safe mode without having the bit locker recovery key.
You don’t want to boot into safe mode with networking enabled if the software that is supposed to detect attacks from the network isn’t running. Safe mode doesn’t protect you from malicious code in userspace, it only “protects” you from faulty drivers. Safe mode is for troubleshooting system components, not for increasing security.
I don’t know the exact reasoning why safe mode requires the BitLocker recovery key, but presumably not doing so would open up an attack vector defeating the BitLocker protection.
There is sandboxing API in Windows. It's called running programs in userspace.
Run what a userspace?
This is the most interesting question to me because it doesn't seem like there is an obviously guessable answer. It seem very unlikely to me that a company like CrowdStrike pushes out updates of any kind without doing some sort of testing, but the widespread nature of the outage would also seem to suggest any sort of testing setup should have caught the issue. Unless it's somehow possible for CrowdStrike to test an update that was different than what was deployed, it's not obvious what went wrong here.
I had read somewhere that the definition file was corrupted after testing, during the final CI/CD pipeline.
6. No development process, no testing.
How is that different from point 2?
I wonder if it was pushed anywhere that didn't crash, as an extension of "It works on my machine. Ship it!"
I've built a couple of kernel drivers over the years and what I know is that ".sys" files are to the kernel as ".dll" files are to user-space programs in that the ones with code in them run only after they are loaded and a desired function is run (assuming boilerplate initialization code is good).
I've never made a data-only .sys file, but I don't see why someone couldn't. In that case, I'd guess that no one ever checked it was correct, and the service/program that loads it didn't do any verification either -- why would it, the developers of said service/program would tend to trust their own data .sys file would be valid, never thinking they'd release a broken file or consider that files sometimes get corrupted -- another failure mode waiting to happen on some unfortunate soul's computer.
The file extension is `sys` by convention, it's nothing magical to it and it's not handled in any special way by the OS. In the case of CrowdStrike, there seems to be some confusion as to why they use this file extension since it's only supposed to be a config/data file to be used by the real kernel driver.
There is a story out that the problem was introduced in a post processing step after testing. That makes more sense than that there was no testing. If true it means they thought they’d tested the update, but actually hadn’t.
Well, Microsoft led by example with #2: https://news.ycombinator.com/item?id=20557488
Number 4 is what everyone will fixate on, but I have the biggest problem with number 1. Anything like this sort of file should have (1) validation on all its pointers and (2) probably >2 layers of checksumming/signing. They should generally expect these files to get corrupted in transit once in a while, but they didn't seem to plan for anything other than exactly perfect communication between their intent and their kernel driver.
I'm betting on them having no internal testing.
No bet. There are two failures here. (1) Failing to check the data for validity, and (2) Failing to handle an error gracefully.
Both of these are undergraduate-level techniques. Heck, they are covered in most first-semester programming courses. Either of these failures is inexcusable in a professional product, much less one that is running with kernel-level privileges.
Bet: CrowdStrike has outsourced much of its development work.
What do you mean by outsourced?
He probably means work was sent offshore to offices with cheaper labor that's less skilled or less vested into delivering quality work. Though there's no proof of that yet, people just like to throw the blame on offshoring whoever $BIG_CORP fucks up, as if all programmers in the US are John Carmack and they can never cause catastrophic fuckups with their code or processes.
Not everyone in the US might be Carmack, but it's ridiculously nearsighted to assert that cultural differences don't play into people desire and ability to Do It Right.
It's not cultural differences that make the difference in output quality, it's pay and quality standards of the output set by the team/management, which is also mostly a function of pay, since underpaid and unhappy developers tend not to care at all beyond doing the bare minimum to not getting fired (#notmyjob, laying flat movement, etc).
You think everyone writing code in the US would give two shits about the quality of their output if they see the CEO pocketing another private jet while they can barley make big-city rent?
Hell, even well paid devs at top companies in the US can be careless and lazy if their company doesn't care about quality. Have you seen some of the vulnerabilities and bugs that make it into the Android source code and on Pixel devices? And guess what, that code was written by well paid developers in the US, hired at Google leetcode standards, yet would give far-east sweatshops a run for their money in terms of carelessness. It's what you get when you have a high barrier of entry but a low barrier of output quality where devs just care about "rest and vest".
I was talking about outsourcing (and not necessarily offshoring). Too many companies like CrowdStrike are run by managers who think that management, sales, and marketing are the important activities. Software development is just an unpleasant expense that needs to be minimized. Hence: outsourcing.
That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!
My experience with "typical" programmers from India, China, et al is that they do exactly what they are told. Their boss makes the design decisions down to the last detail, and the "programmers" are little more than typists. I specifically remember one sweatshop where the boss looped continually among the desks, giving each person very specific instructions of what they were to do next. The individual programmers implemented his instructions literally, with zero thought and zero knowledge of the big picture.
Even if the boss was good enough to actually keep the big picture of a dozen simultaneous activities in his head, his non-thinking minions certainly made mistakes. I have no idea how this all got integrated and tested, and I probably don't want to know.
>That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!
Sure but there's no proof yet that was the case here. That's just masive speculations based on anecdotes on your side. There's plenty of offshore devs that can run rings around western devs.
Offshoring and outsourcing is very different. It would be also very hard to talk about offshoring at a company claiming to provider services in 170 countries.
It's probably just the common US-centric bias that external development teams, particularly those overseas, may deliver subpar software quality. This notion is often veiled under seemingly intellectual critiques to avoid overt xenophobic rhetoric like "They're taking our jobs!".
Alternatively, there might be a general assumption that lower development costs equate to inferior quality, which is a flawed yet prevalent human bias.
“You get what you pay for” is still a reasonable metric, even if it is more a relative scale than an absolute one.
Don’t we have those kind of failures in almost every professional product? I’ve been working in the industry for over a decade and in every single company we had those bugs. The only difference was that none of those companies were developing kernel modules or whatever. Simple saas. And no, none of the bugs were outsourced (the companies I worked for hired only locals and people in the range of +- 2h time zone)
I'd make that 98%. Outside of rounding errors in the margins, the remaining two percent is made up of logic bugs, configuration errors, bad defaults, and outright insecure design choices.
Disclosure: infosec for more than three decades.
They forgot to account for those edge cases
Heh, touché.
I feel vindicated but also a bit surprised that my gut feeling was this accurate.
Not really a surprise, to be honest. "Deserialisation" encapsulates most forms of injection attacks.
OWASP top-10 was dominated by those for a very long time. They have only recently been overtaken by authorization failures.
For the record, the top 25 common weaknesses for 2023 are listed at:
* https://cwe.mitre.org/top25/archive/2023/2023_top25_list.htm...
Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.
CWEs that have been in every list since they started doing this (2019):
* https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weakn...
# Top Stubborn Software Weaknesses (2019-2023)
Out-of-bounds Write
Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)
Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)
Use After Free
Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')
Improper Input Validation
Out-of-bounds Read
Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
Cross-Site Request Forgery (CSRF)
NULL Pointer Dereference
Improper Authentication
Integer Overflow or Wraparound
Deserialization of Untrusted Data
Improper Restriction of Operations within Bounds of a Memory Buffer
Use of Hard-coded Credentials
Yup. Almost all of them are various flavor of fucking up a parser or misusing it (in particular, all the injection cases are typically caused by writing stupid code that glues strings together instead of proper parsing).
That's not parsing, that's the inverse of parsing. It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly. It's compiling, of a sort.
Parsing is the reverse—taking an untrusted string (or binary string) that is meant to be code and converting it into a data structure.
Both are the result of taking untrusted data and assuming it'll look like what you expect, but both are not parsing issues.
I can't decide what's more damning. The fact that there was effectively no error/failure handling or this:
If your content updates can break clients, they should not be able to bypass staging controls or policies.
The way I understand it, the policy the users can configure are about "agent versions". I don't think there's a setting for "content versions" you can toggle.
Maybe there isn't a switch that says "content version",but from end user perspective it is a new version. Whether it was a content change, or just a fix for typo in documentation (say) the change being pushed is different than what currently exists.And for the end user the configuration implies that they have a chance to decide whether to accept any new change being pushed or not.
This is going to be what most customers did not realize. I'm sure Crowdstrike assured them that content updates were completely safe "it's not a change to the software" etc.
Well they know differently now.
So, I also have near zero cybersecurity expertise (I took an online intro course on cryptography due to curiousity) and no expertise in writing kernel modules actually, but why if ever would you parse an array of pointers...in a file...instead of any other way of serializing data that doesn't include hardcoded array offsets in an on-disk file...
Ignore this failure which was catastrophic, this was a bad design asking to be exploited by criminals.
I'm curious, how else would you store direct memory offsets? No matter how you store/transmit them, eventually you're going to need those same offsets.
The problem wasn't storing raw memory offsets, it was not having some way to validate the data at runtime.
Performance, I assume. Right now it may look like the wrong tradeoff, but every day in between incidents like this we're instead complaining that software is slow.
Of course it doesn't have to be either/or; you can have fast + secure, but it costs a lot more to design, develop, maintain and validate. What you can't have is a "why don't they just" simple and obvious solution that makes it cheap without making it either less secure, less performant, or both.
Given all the other mishaps in this story, it is very well possible that the software is insecure (we know that), slow and also still very expensive. There's a limit to how high you can push the triangle, but there's not bottom to how bad it can get.
Interesting observation. As a non-developer, what can one do to enhance coverage for these types of scenerios? Fuzz testing?
Fuzz testing absolutely should be used whenever you parse anything.
Yeah, even if you are only parsing "safe" inputs such as ones you created yourself. Other bugs and sometimes even truly random events can corrupt data.
People are target fixating too much. Sure, this parser crashed and caused the system to go down. But in an alternative universe they push a definition file that rejects every openat() or connect() syscall. Your system is now equally as dead, except it probably won't even have the grace to restart.
The whole concept of "we fuck with the system in kernel based on data downloaded from the internet" is just not very sound and safe.
It's not and that's the sad state of AV in Windows
Hmmm. Most common problems these days are certificate related I would have thought. Binary data transfers are pretty rare in an age of base64 json bloat
There are plenty of binary serialisation protocols out there, many proprietary - maybe you’ll stuff that base64’d in a json container for transit, but you’re still dealing with a binary decoder.
I was immediately willing to bet a hundred quid this was C/C++ code :)
Not that interesting a bet considering we know it's a Windows driver.
What's that, three pints in a pub inside the M25? :P
Completely agree with this sentiment though, we've known that handling of binary data in memory unsafe languages has been risky for yonks. At the very least, fuzzing should've been employed here to try and detect these sorts of issues. More fundamentally though, where was their QA? These "channel files" just went out of the door without any idea as to their validity? Was there no continuous integration check to just .. ensure they parsed with the same parser as was deployed to the endpoints? And why were the channel files not deployed gradually?
FWIW, before someone brings up JSON, GP's bet only makes sense when "binary" includes parsing text as well. In fact, most notorious software bugs are related to misuse of textual formats like SQL or JS.
Yes indeed. If you are doing this kind of job, reach for a parser generator framework and fuzz your program.
Also go read Parse Don’t Validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
Yep.
Looking at how this whole thing is pasted together, there's probably a regex engine in one of those sys files somewhere that was doing the "parsing"...
next time you'd be adding /s to your posts
More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.
This. One year ago UK air traffic control collapsed due to inability to properly parse "faulty" flight plan: https://news.ycombinator.com/item?id=37461695
This problem has a promising solution, WUFFS, "a memory-safe programming language (and a standard library written in that language) for Wrangling Untrusted File Formats Safely."
HN discussion: https://news.ycombinator.com/item?id=40378433
HN discussion of Wuffs implementation of PNG parser: https://news.ycombinator.com/item?id=26714831
I wouldn't blame imperative programming.
Eg Rust is imperative, and pretty good at telling you off when you forgot a case in your switch.
By contrast the variant of Scheme I used twenty years ago was functional, but didn't have checks for covering all cases. (And Haskell's ghc didn't have that checked turned on by default a few years ago. Not sure if they changed that.)
Related talk:
28c3: The Science of Insecurity (2011)
https://www.youtube.com/watch?v=3kEfedtQVOY
I’d say that it is a bug by definition if your program ungracefully crashes when it’s passed malformed data at runtime.
"human programmers forget to account for edge cases"
Which is precisely the rationale which led to Standard Operating Procedures and Best Practices (much like any other Sector of business has developed).
I submit to you, respectfully, that a corporation shall never rise to a $75 Billion Market Cap without a bullet-proof adherence to such, and thus, this "event" should be properly characterized and viewed as a very suspicious anomaly, at the least
https://news.ycombinator.com/item?id=41023539 fleshes out the proper context.