Due to the scale I think it’s reasonable to state that in all likelihood many people have died because of this. Sure it might be hard to attribute single cases but statistically I would expect to see a general increase in probability.
I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.
My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.
I don't see what this has to much do with MS. A bad proprietary kernel module can crash any OS.
I don’t know the specifics of this case, but formal verification of machine code is an option. Sure it’s hard and doesn’t scale well but if it’s required then vendors will learn to make smaller kernel modules.
If something cannot be formally verified at the machine code level there should be a controls level verification where vendors demonstrate they have a process in place to achieving correctness by construction.
Driver devs can be quite sloppy and copy paste bad code from the internet, in the machine code Microsoft can detect specific instances of known copy and pasted code and knows how to patch it. I know they did this for at least one common error. But if I was in the business of delivering an OS I want people to rely on my OS this stuff formal verification at some level would be table stakes.
I thought Microsoft did use formal verification for kernel-mode drivers and that this was supposed to be impossible. Is it only for their first-party code?
No, I believe 3rd party driver developers must pass Hardware Lab Kit testing for their drivers to be properly signed. This testing includes a suite of Driver Verifier passes that are done, but this is not formal verification in the mathematical sense of the term.
I wasn’t privy to the extent it was used, if this was formally verified to be correct and still caused this problem then that really would be something. I’m guessing given the size and scope of an antivirus kernel module that they may have had to make an exception but then didn’t do enough controls checking.
MS could've leaned more towards user-space kernel drivers though. Apple has been going in that direction for a while and I haven't seem much of that (if anything) coming from MS.
That would have prevented a bad driver from taking down a device.
Apple created their own filesystem to make this possible.
The system volume is signed by Apple. If the signature on boot doesn't match, it won't boot.
When the system is booted, it's in read-only mode, no way to write anything to it.
If you bork it, you can simply reinstall macOS in place, without any data/application loss at all.
Of course, if you're a tinkerer, you can disable both, the SIP, and the signature validation, but that cannot be done from user-space. You'll need to boot into recovery mode to achieve that.
I don't think there's anything in NTFS or REFS that would allow for this approach. Especially when you account for the wide variety of setups on which an NTFS partition might sit on. With MBR, you're just SOL instantly.
Apple hardware on the other hand has been EFI (GPT) only for at least 15 years.
Well we all know where Microsoft is in security… even the government acknowledges it’s terrible
No other OS forces an auto-restart.
Well, not the OS, per se, but macos updating mechanisms have auto-restart path, and I imagine any Linux update that touches the kernel can be configured in that way too. It's more the admin's decision then OS's but on all common systems auto-restart is part of the menu too.
No restart was needed to cause this crash. As soon as Falcon downloads the updated .sys file ... BOOM.
An immutable OS can be set up to revert to the previous version if a change causes a boot failure. Or even a COW filesystem with snapshots when changes are applied. Hell, Microsoft's own "System Restore" capability could do this, if MS provided default-on support for creating system restore points automatically when system files are changed & restoring after boot failures.
Right, an OS completely crashing like this is the fault of the OS and the problematic code.
An OS should be really resistant to this kind of things.
What's funny to me is that in college we had our computer lab set up such that every computer could be quickly reverted to a good working state just by rebooting. Every boot was from a static known good image, and any changes made while the computer was on were just stored as an overlay on a separate disk. People installed all manner of software that crashed the machines, but they always came back up. To make any lasting changes to the machine you had to have a physical key. So with the right kind of paranoia you can build systems that are resilient to any harmful changes.
I blame Microsoft in the larger sense; they still allow kernel extensions for use cases that Apple has shown could be moved outside the kernel.
I love their forced updates, because if you know what you're doing you can disable them, and if you don't know what you're doing, well you shouldn't be disabling updates to begin with. I think people forget how virus infested and bug addled Windows used to be before they enforced updates. People wouldn't update for years and then bitch how bad Windows was, when obviously the issue wasn't Windows at that point.
If the user wants to boot an older, known-insecure, version so that they can continue taking 911 calls or scheduling surgeries... I say let 'em. Whether to exercise this capability should be a decision for each IT department, not imposed by Microsoft on to their whole swarm.
Microsoft totally lets them. If you use any Enterprise version of Windows, the company can disable updates, but not the user.
No, after the fact. Where's the prompt at boot-time which asks you if you want to load yesterday's known-good state, or today's recently-updated state?
It's missing because users are not to be trusted with such things, and that's a philosophy with harmful consequences.
We took 911 calls all night, I was up listening to the radio all night for my unit to be called. The problem was the dispatching software didn't work so we used paper and pen. Glory Days!!!!
Ignoring all of the other approaches to that problem I wonder if this update will take the record for most damage done by a single virus/update. At some point the ‘cure’ might be worse than the disease. If it were up to me I would be suggesting different cures.
At one company I used to work for, we had boring, airgapped systems that just worked all the time, until one day security team demanded that we must install this endpoint security software. Usually, they would fight tooth and nail to prevent devs from giving any in-house program any network access, but they didn't even blink once to give internet access to those airgapped systems because CrowdStrike agents need to talk to their mothership in AWS. It's all good, it's for better security!
It never caught any legit threat, but constantly flagged our own code. Our devs talked to security every other week to explain why this new line of code is not a threat. It generated a lot of work and security team's headcount just exploded. The software checked a lot of security checkboxes, and our CISO can sleep better at night, so I guess end of day it's all worth it.
When I worked in large enterprise it got to the point that if a piece of my app infrastructure started acting weird the blackbox security agents on the machines were the first thing I suspected. Can't tell you how many times they've blocked legit traffic or blown up a host by failing to install an update or logging it to death. Best part is when I would reach out to the teams responsible for the agents they would always blame us, saying we didn't update, or weren't managing logs etc. Mind you these agents were not installed or managed by us in any way, were supposed to auto update, and nothing else on the system outran the logrotate utility. Large enterprise IT security is all about checking boxes and generating paperwork and jobs. Most of the people I've interacted with on it have never even logged into a system or cloud console. By the end I took to openly calling them the compliance team instead of the security team.
I know I've lost tenders due to not using a pre-approved anti-virus vendors which really does suck and has impinged the growth of my company, but since I'm responsible for the security it helps me sleep at night. This morning I woke up to a bunch of emails and texts asking me if my systems have been impacted by this and it was nice to be able to confidently write back that we're completely unaffected.
I day-dream about being able to use immutable unikernels running on hypervisors so that even if something was to get past a gateway there would be no way to modify the system to work in a way that was not intended.
Air-gapping with a super locked down gateway was already getting more popular precisely due to the forced updates threat surface area, and after today I expect it to be even more popular. At the very least I’ll be able to point to this instance when explaining the rational behind the architecture which could help in getting exemptions from the antivirus box ticking exercise.
My windows machine notified me of the update, asked me to restart. I was busy, so I didn't. Then the news broke, then the update was rolled back.
It wasn't a Windows update. If you got a notification for an update, it wasn't the update that did this.
This is almost definitely on Crowdstrike.
There is a windows release preview channel that exists for finding issues like this ahead of time.
To be fair - it is possible the conflicting OS update did not make it to that channel. It is also possible it is due to an embarassing bug from MSFT (uknown as yet).
Until I hear that this is the case - I am pinning this on Crowdstrike. This should have been caught before prod.
Even if this is entirely due to Crowdstrike I see it as Microsofts failure to properly police their market.
There is the correctness by testing vs correctness by construction dynamic and in my view given the scale of interactions between an OS and the kernel modules trying to achieve correctness by testing is negligent. Even at the market scale Microsoft has there are not enough Windows computers to preview test every combination. Especially when taking into account the people on the preview ring have different behaviors to those on the mainline so many combinations simply won't appear in the preview.
I see it as Microsoft owning the Windows kernel module space and has allowed sloppiness by third parties and themselves, I don't know the specifics but I could easily believe that this is a due to a bug from Microsoft. The problem with allowing such sloppiness is that the slopy operators out compete the responsible operators, the bad pushes out the good until only the bad remains. A sloppy developer can push more code and gets promoted while the careful developer gets fired.
As a tester, I'm frustrated by how little support testing gets in this industry. You can't blame bad testing if it's impossible to get reasonable time and cooperation to do more than a perfunctory job.