We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability.
Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products.
Reducing the need for kernel drivers to access important security data.
They are being as diplomatic as they can, but it's definitely a slap to CS. Read as "they don't know how to roll things out, they need guidance on basic QA practices, we'll happily teach them...". Then, they list a set of facilities running in user-mode to avoid needing to run as many things in kernel mode.
I would be interested what the water cooler discussion about CS was like inside Microsoft. Especially in teams needed to respond to customers about "Your windows OS is broken, our hospital patients are suffering...".
I can tell you they’re quite unhappy about it. Have a friend working there who frustratedly says it wasn’t their fault every-time it comes up. Which is quite often and at every social occasion since.
but it's kind of their fault? they designed the api that way, they decided what can be done in userland and what must be done via kernel. they at least _allowed_ it to happen every time.
When a parking valet takes a car on a joy ride and crashes into a tree, we could blame the tree. We could blame the car owner for handing over the key. We could blame the auto manufacturer that didn't provide a "valet mode". We could blame the police for not detecting the joy ride before the crash.
All of these parties could do better (stupid tree!). But the real problem is the valet.
We can say that it is obvious that the electronics-heavy cars of today should anticipate rogue valets and build in protections. But we shouldn't let rogue valets off the hook for damages.
As a consumer, you could choose to only purchase cars that have "valet mode". So should we blame consumers who don't? If so, we should blame the airlines, hospitals, etc.--not Microsoft.
How about we prosecute valets unless they refuse to park cars that don't have "valet mode"?
You could also prosecute the establishment that keeps a valet with an abominable record on staff.
Microsoft took no steps to force-eject them from their ecosystem, despite their long history of issues.
Just to be clear within the analogy: are you expecting the auto manufacturers to "force-eject" any hotel on Park Ave that has a record of valet mishaps? Or did you mean individual cars should force-eject the valet?
If a Caesars Entertainment property in Macao has enough incidents, should GM update the firmware on their automobiles to force-eject valets at Caesars Entertainment properties in Las Vegas?
Now imagine that GM actually operates valet services in Macao and Las Vegas. Should they be allowed to force-eject valets from competing services?
I am not a Microsoft apologist. I think they should do better. I think Linux and FreeBSD should do better. I personally avoid Microsoft products. But I place more blame on people who use MS products than I do on MS. After all, I never intend to hand my beat up old Corolla over to a valet so why should I have to pay for a "valet mode" feature that Toyota is forced to build into all their cars? Isn't it reasonable that motorcycles, 18-passenger vans, and scooters don't need "valet mode"?
In my book, the auto manufacturer is lower on the list of culprits than the valet, "the establishment that keeps a valet with an abominable record on staff", and the vehicle owner. But some place like Car and Driver could definitely prioritize encouraging GM or Toyota to develop valet modes over berating owners; so I don't mind a place like HN shooting a few arrows at MS. Unless the general public follows their lead and lets bad guys off the hook by shifting too much focus to somebody lower on the list.
Not OP, but I think the analogy here is the hotel "fore-ejecting" (firing) the valet with a history of doing joy rides. That seems very reasonable.
In the analogy, it seems Microsoft is a car manufacturer. The hotel is the company that bought software from CrowdStrike. The problem is that Microsoft should not control who has access to which APIs, that is a huge can of worms, and actually called anticompetitive by the EU from what I understand. At MS level, either they publish APIs or not. If published, anyone should be able to write software for them. This is especially bad if MS themselves also sell security software that uses the same APIs. It would literally mean MS deciding who is allowed to compete with their security software.
I think it works better (please allow me to change it) if Microsoft is the hotel. Crowdstrike is the restaurant inside the hotel. The restaurant is serving poisoned food to the guests, who assume it is a decent restaurant because it is in their hotel.
Also the restaurant has their own entrance without security and questionable people are entering regularly, and they are sneaking into the hotel rooms and stealing some items, breaking the elevator.
At the same time, the hotel is in a litigation process with the restaurants association, because in the past they did not allow any restaurant on their premises. The guests, naturally, do not care about this, since their valuables have been stolen, and they have food poisoning. The reputation of the hotel is tarnished.
I don't think this works since Microsoft isn't the hotel. The hotel in your example chooses which restaurants are inside, but Microsoft doesn't. In this example, Microsoft is the builder who built the hotel building for a 3rd party. That 3rd party decides which restaurants it wants to partner with, as well as any other rules about what goes on in the building.
If the builder came around and made changes to ban the 3rd party's restaurant partner, that would cause a ton of issues and maybe get the builder sued.
Microsoft can't decide what can and can't run on their platform - the most they can do is offer certification which can't catch everything, as we just saw with Crowdstrike since they decided to take a shortcut with how they ship updates. Microsoft also had to allow for equal API access so they don't get sued by the EU.
Operating system (hotel) decides which programs run in kernel mode (Crowdstrike) but ok. Let me address the other point.
Again the reasoning of allowing equal API access to avoid getting sued is a false dichotomy: Microsoft could choose to make an OS that would not need such mechanisms to be simply usable.
They could also remove their own crowdstrike-alike offering, so that it would not be considered anti-competitive. They could also choose not to operate in EU. Of course, that would lower their profits, which is the real motive here.
Once you sum it up the reasoning goes: hospitals/flights can stop working because a company cannot lower its profits, and said company is not to blame at all. It is clearly false, the rest is sophism, and back-bending arguments IMO.
This is the correct interpretation. I am surprised that people took it in different directions.
I'm expecting restaurant owners to fire bad valets.
Or in Microsoft's case, via regulatory, social, or software, prevent Crowdstrike from causing harm to their customers.
I'm aware it's a sticky regulatory situation, but CS has a history of these failings and the potential damage could be severe. Despite this, no effort (that I am aware of) was made by Microsoft to inform customers that Crowdstrike introduced potential risks, nor to inform regulators, nor to remove the APIs CS depends on.
I don't believe Microsoft is solely responsible, but I do believe that throwing all of the blame for the very real harm that was caused onto CS alone is missing a piece of the puzzle.
Last aside, every large corp has team(s) focused on risk. There's approximately zero chance they didn't discuss CS at some point. The only way this would not have happened is negligence.
Can Microsoft legally ban a competitor for percieved incompetence? I doubt it . partiuclarly seeing how much competence is shown with windows and MS teams software
Microsoft assigns driver levels to these guys etc. and allows them to load kernel mode components as protected etc.. If they do not allow that - CS cannot cause such damages. ofcourse, as you pointed out, this will then turn into some lawsuit blaming MS for killing competitors, even if they do it to try and protect their customers.
wonderful world.
Back in 2006 Microsoft tried to keep 3rd party vendors out of their ecosystem. <https://arstechnica.com/information-technology/2006/10/7998/> As a result of a complaint to the EU Microsoft was required to let them have kernel access. <https://www.theregister.com/2024/07/22/windows_crowdstrike_k...>
Microsoft was required to let them have the same access their own software used. Which seems fair to me. Microsoft can remove those APIs entirely, they just can't restrict them.
I’m pretty sure anti trust law doesn’t allow Microsoft to go anywhere near that kind of action, even if they wanted to be more Apple like.
Problem is that the establishment here is well the establishment. That is the state itself. Or at least one of them. As somehow MS is in position where for any slight anti-trust thing they will be prosecuted. Our system is setup to allow these actors in...
No, the operating system is supposed to provide secure access to hardware and isolate independent subsystems so they can't interfere with each other. That's its whole purpose for existing. The fact that people feel they need to deploy CS is a Microsoft failure. Windows is just not a secure OS.
You’re shifting practically the entirety of the blame to a company that at best was an accomplice to the issue.
I get that you hate Microsoft, but not everything is their fault and it’s disingenuous to pretend otherwise.
CS is also available and widely deployed on Mac and Linux. Is that a failure of Apple and all the distros? It literally took down Debian and Red Hat systems earlier this year, is that also not CS’s fault?
They don't need to deploy shit. Only reason it's deployed because it's a whole racket.
You could also choose to park the car yourself or plan for a secondary mode of transportation if something happened to your car.
Not the best analogy. The organization who deploys said software is responsible for the uptime of their systems. They didn't have to use CrowdStrike and if they do they should have a plan in the event of failure.
They didn’t have much of a choice - it is very hard to get adequate performance with real-time filesystem filtering without doing it in kernel mode. Not aware of any other mainstream OS which succeeds at that.
And they kind of had to provide this feature, since they’ve supported it since forever (antivirus vendors were already doing it back in the days of MS-DOS and Windows 3.x/9x/Me), and there is a lot of market demand for it. It is easy for Linux to say “no” when it never has had support for it (in official kernels)
But, as the blog post points out, it sounds like CrowdStrike is doing a lot of stuff in kernel mode that could be done in user mode instead - whether due to laziness or lack of investment or lack of sophistication of their product architects
Microsoft, in allowing third party code to be loaded into their kernel, is no different from other major OS kernels, such as Linux or Apple XNU.
Apple is (increasingly) the most restrictive about this, and a lot of people criticise them for it.
Even Linux imposes some restrictions-which kernel symbols to export (at all or as GPL-only)—although of course being open source, you can circumvent all restrictions by changing the code and recompiling
Mac and Linux run EDRs in userspace without an issue. No one here has an excuse or no choice.
Linux these days tends to use eBPF which isn't really in userspace per-se.
eBPF is like the Twilight Zone. I'm in kernel space but, I'm not.
eBPF is Linux denying the fact that it's turning into a microkernel and that Linus was wrong.
If you're right for 30 years in tech you're right, even if things eventually change.
The famous Tannenbaum-Torvalds debate happened all the way back in 1992. At the time, the most common microkernel was Mach, which had significant performance problems. NeXT/Apple solved them by transforming Mach into a monolithic kernel, making Mach (as XNU) one of the most popular kernels in the world today (powering iPhones, iPads, Macs, etc). But that doesn’t help Tannenbaum‘s side of the argument. And I don’t believe his own Minix did much better than Mach did.
Whereas, from what I hear, L4 and its derivatives have solved this problem in a way that Mach/Minix/etc could not. Yet still, it makes me wonder, if L4 has really solved it, why aren’t we all running L4? L4 has had some success in embedded applications (such as mobile basebands, Apple Secure Enclave); but as a general purpose operating system has never really taken off.
from what I understand a huge number of computers run Minix, but only in the Intel Management Engine
Well they crowdstrike crashed a kernel with it
Apparently that wasn't (entirely) CrowdStrike's fault: https://news.ycombinator.com/item?id=41030352
Whereas this Windows outage rather obviously was.
eBPF being able to crash the kernel is usually sign of a kernel bug. And it sounds like in this case it was even a bug specific to Red Hat kernels, introduced by a Red Hat patch.
That said, even if they are triggering a Red Hat kernel bug, CrowdStrike should be testing their software adequately enough to pick up that issue before customers do – and it sounds like they haven't been
That was more of a kernel bug than a crowdstrike bug. However, it's clear that they are pushing what you can do in kernel space to the limits, which is not a great sign.
Isn't being able to crash anything with eBPF is a bug in either kernel or eBPF? As I understand it's supposed to prevent exactly that.
Can you re-read the list (source Wikipedia) in one of the comments in the tree? It had Debian And RedHat issues listed on different dates.
You can't just let people do anything from userland, the performance would tank. As for restricting kernelland, EU competition regulators would not be happy if MS was the only one able to write anti virus software that runs in kernelland.
Isn't the point of userland that you can (try to) do anything from there?
It seems like MacOS and Linux provide substantially safer alternatives that are still performant?
I keep seeing people say this. Is there a basis for that assertion, or is that mere speculation? Again, hasn't MacOS already deprecated kexts?
There is basis for that assertion.
Via Google: https://www.techtarget.com/searchsecurity/news/450420491/Mic...
(Also via myself, as I was at MS when we wanted to make this change and the EU said no.)
Well Microsoft did not publicly commit to using the same APIs, and no privileged access, for its own antivirus products. That's why the EU said no way; not because kernel access was revoked.
Yes, but then of course Microsoft is being obligated to open part of kernelspace to competitors, which is arguably "OK" from a competitive regulation perspective, but that then places a special burden on competitors to maintain code hygiene given the potential for crashes. It makes CrowdStrike's negligence all the more unacceptable.
MacOS still keeps the kexts support around, even if the long term roadmap is to move everything into userspace.
What are the Linux alternatives you are talking about?
[flagged]
[flagged]
Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.
(Your comment would be fine without that first bit.)
https://news.ycombinator.com/newsguidelines.html
There are ways around this that I've discussed elsewhere so I won't repeat them here.
However, think of it this way: Windows restarts, tries to load with new patch and crashes.
Question: why can't Windows be designed so that on crash it automatically restarts and loads the previous state sans patch?
Answer: Windows could be designed that way but it would require Microsoft to do many things it doesn't want to do. Some of which would require Microsoft to go back to the beginning and reengineer quarter-century or more old code from scratch, that means redesigning APIs and the underlying architecture from first principles.
Why doesn't Microsoft want to do this? It's obvious so I won't bother to spell it out.
Nevertheless, when the dust fully settles and someone outlines these alternative design strategies in great detail then it'll be obvious to everyone what a fragile stack of cards Windows has been constructed on.
Please don't post in the flamewar style to HN, such as you did here and downthread (https://news.ycombinator.com/item?id=41096774). It's not what this site is for, and destroys what it is for.
If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
Your car _allows_ you to drive off a cliff. If you do so, it is your fault, not the fault of the car manufacturer.
Kind of weird that anyone is blaming Microsoft for any part of this, imo
Mmm… meaningless analogies are kind of meaningless?
More like:
If you install a security product that then prevents your car from starting; are they entirely blameless for letting you install it?
If you pull the hood up, tear off the “voids warranty” seal, ignore the “don’t open this” labels, crack the seals open and shove something into the engine… sure.
…but if you just slap a widget with the “vendor approved” sticker on your dash and it bricks your car; that’s a bit sucky right?
I do feel Microsoft is not entirely blameless in this.
It should be easier to recover from this kind of thing.
They should have been paying attention and made a fuss that one of the biggest security vendors has been doing this literally since they started.
I would bet money that until two weeks ago Microsoft was high-5ing them for best security practices.
It’s not “their fault” but they can’t just go “wasn’t us!”.
It was them.
It wasn’t macOS. It wasn’t *nix.
Suck it up. They should’ve done better.
Except Crowdstrike had 3 separate Linux incidents, including kernel panics, directly before this happened.
And at least one of them was actually a Redhat kernel bug, where eBPF caused a kernal panic when it shouldn't be able to?
That is the problem: you feel.
Before Microsoft comes into the picture the issues is crowdstrike pushing updates without proper testing, selling a product on which customers cannot control the update schedule, and customers for being so naives and not checking what the product they install on critical stuff do.
The big difference is that CS is not the user. In you analogy it's like your car allows you to drive off a cliff, and an (almost) essential part of your car (for example, the pedal) drives the car off a cliff.
It got there because a user or administrator approved and installed it. It didn't just appear there, Microsoft didn't install it there. The user ran it.
Right, so a slightly better analogy would be if you wanted to install a remote starter, but then you find out that they can only be installed into Fords, because other auto manufacturers (Apple, Linux in this case) believe that tampering with the critical path (the engine, kernel) is unsafe. It isn't Ford who's at fault for allowing you to run some random engine modification, it's that mod that is at fault.
If it's a custom after market part, how can you blame the car manufacturer and not the part maker?
i would have thought that in 2024 a bad driver update is something that windows would automatically roll back.
or at least provided some level of protection against crashes in third party kernel code.
No you can’t roll back bad driver updates in any OS, if you could then by definition they do not sit in the kernel space. You just want the security code to not run in kernel space, which is a decision MS could maybe make and become like Apple, though most security software would in that case rebel.
it depends on how bad. in Linux you can rmmod to get rid of the bad one if you haven't wedged it and fix your code, compile, and try again. I can't imagine that's actually different on windows if you know what you're doing. how do you think driver development happens?
drivers and kernel binaries are typically installed and maintained by user space programs that run with some sort of elevated privileges.
"kernel space" is just a runtime context, what gets loaded into there typically comes ordinary (protected) files on the disk.
That doesn't make any sense.
The OS loads file A into the kernel. It crashes. It reboots. It decides not to load file A this time.
Wow, it's a rollback of kernel-space code.
Unless your argument is that you can't guarantee a rollback of every possible kernel driver, because it might have installed a rootkit while it had full control? Okay, cool, but this isn't a malware removal idea. It's an idea for normal drivers.
Good explanation about this point at 11:15 over at https://youtu.be/wAzEJxOo1ts?si=wGXDJZtUczcIui9F
I think if I understand the systems right Windows can roll back a bad driver update but the CS update wasn’t an update to the driver but instead updated a configuration file which CS updated outside of Windows Update. So from the Windows Update perspective the system started failing to boot with no changes to the system. Again though I don’t know if I totally understand what CS did and what capabilities Windows Update has.
It was not a driver update.
Microsoft tried to lock down kernel access in the Windows Vista era. Antivirus vendors went crying to the EU and they forced Microsoft to allow access to the kernel to third parties.
it's like userland video driver - thousands context switches per second, performance will dive...
An OS flexible enough where you can do something stupid enough to completely break it.
Basically IOS which is so locked you can't even run apps not expressively approved by Apple.
Pick one. If I build a bike and you remove the breaks to save weight don't get mad at me when you crash.
Honestly most of the conversations were about getting everyone back online.
I noticed this at work and in some other contexts last week. We weren't affected by this, but most of the people that brought this up, even technical people (other fields, not security or OS or anything like that), think that this was a Microsoft and Windows issue. they all seem surprised to hear that Microsoft wasn't the root cause of this, and they all seem surprised, because no one knows or understands what Crowdstrike is or does.
Microsoft has a loooong history of botched (security) updates, so I'm not hopeful they can teach Crowdstrike much.
Do you happen to have a list of that "loooong history" of botched (security) updates?
I can only find a couple of examples after googling, which a bit smaller than a "loooong history" you're talking about, so unless Microsoft is paying Google to delete results, maybe you're mistaken.
This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes. Anybody who was paying attention knew that you didn't use any new Windows release until at least the first service pack had come out.
Granted that was a while back but painful memories die hard.
>This is a company whose OS could not even be installed on a live network without getting rooted within a few minutes.
That was WIndows XP 20 years ago. Please bring arguments about modern Window 11 security which is the current up to date product they're selling and supporting not scenarios that haven't happened in 20 years.
for a loooong history, you have to look in the past
Ah, well, if only things of the past were useful today, I'd still have hair, and probably millions made form right investments, but unfortunately, it's what's happening today that actually matters.
So you asked for proof of a long history and are now surprised that the examples are all from the past?
How does that impact the present? If it's no longer as vulnerable today, why would I care about the past? The point is learning from mistakes and fixing them so that doesn't happen again.
If it doesn’t matter to you, why did you ask? Are you just trying to win an argument or are you being intellectually honest? Because you asked for proof of the long history someone claimed. You could have just said “the long history doesn’t matter because I only care about the current state”. That’s fine and valid, but don’t ask questions and then shift the goalposts if you don’t like the answers.
A "loooong history" needs to have a timespan of many years.
So yes it would start in the past, but it then has to continue for a long time.
Pointing out that a company was bad 20 years ago isn't enough. You need to show they were also bad 15 years ago, and 10 years ago, and 5 and/or 25 years ago.
So complaining that the only evidence was so far in the past is valid. The original goalposts were not reached. (Well, someone in another part of the thread eventually listed every google result for a windows update making anything crash, but that doesn't really establish that microsoft is "botching" updates at a level significantly above background noise, which I think was the original intent.)
Well someone posted examples from XP and someone else posted 4 botched updates in 2023, do you need a list for every year inbetween?
Was my implication of "every 5 years" not clear? But I already mentioned those links, they're pretty weak. I'm not calling an update that for a few people makes a handful of games crash "botched", when the original implication was quite juicy botching.
Also, if we're actually getting into this, the XP gripe had nothing to do with updates. That's moving the goalposts half a mile in the other direction.
??? You specifically asked for it! What are you doing.
GP is absolutely correct. You can't ask for examples of a long history of something, then dismiss examples from, you know, history.
Fair enough, but if those examples are irelevant to modern times, what's the point of bringing them up? If we want to keep the discussion relevant to modern context then let's discuss modern history, not obsolete news from 20 years ago.
What is "modern history"?
A period of time where Microsoft has no mishaps, of course.
First thing that comes to mind is that Recall stuff from a month ago, they also release updates[0] that crash machines.
[0] https://www.tomsguide.com/news/windows-11-update-causing-blu...
Recall actually is a brilliant idea, and I dreamed of something like it for a long time, and so did plenty people here. It's just not something you can trust a third-party business with, whether it's a fly-by-night startup or an international megacorporation known to be openly promiscuous with advertisers.
This is basically "take a screenshot every 30 seconds and compile it into a timelapse", but on steroids, and the same appeal, and arguments wrt. who gets to run it on whose machines, all apply.
If you keep your business and personal computing separate, Recall looks amazing.
The functionality does seem intriguing, that doesn't change it's security profile which was poorly thought out and implemented.
Ignoring Windows Insider reports is bad. However, how many endpoints having issues (out of a billion+) is ‘acceptable’ after an update? We live in a news hype cycle so clearly even the one wrong failure will make it up somewhere.
However, without metrics that show BSoDs from patches (which MS will likely never share), it’s hard to see if things have improved or regressed. If they regressed, someone up in their leadership chain is hopefully following the constructive discussion here.
The company that let every db server have global admin creds and 0 logging on their cloud platform?
That didn't run their own enhanced visibility on their own cloud platform.
Vulnerabilities present in 2000 are showing up still in modern Windows versions.
https://www.csoonline.com/article/564499/3-leaked-nsa-exploi...
You have no idea the cruft and technical debt Windows has in order to maintain its backwards compatibility.
That's a bit disingenuous, though. That was, as 'Rinzler89 points out, some 20 years ago. Back then, any Linux distro would've definitely been much safer option, because after installing you couldn't even connect it to the network, because it had no support for your cable modem or wireless card, and that's assuming you didn't fuck up your MBR with LiLo for the 20th time. Ask me how I know.
Both OS families have changed much since that time.
In 2002 I wasn't yet even out of middle school when I had a Linux distro running all key hardware components "just working". At that time at my school we were taught how to search the web, so I searched the web and looked up what hardware worked. Very simple. All I had to pitch to my parents was, "this system shares its code and encourages me to study it and learn code", which made clear to them what I was asking for wasn't just another video game console. Soon after I had a refurb laptop (fortunately not x86) and a curated WiFi card that ran Linux (and soon after, BSD) with everything "just working".
When I see someone complain about unsupported/unsupportable chips in comments on online forums, especially one dubbed "Hacker News", I am puzzled how I in my middle school years acted out a pattern that is objectively smarter* than what I read in such comments. I also happen to first-hand know I am for sure not the only one with this vantage point. Those who comment about unsupported/unsupportable chips as if it is somehow an open source kernel's fault might want to take a moment to consider how others, and how many others, are viewing such drivel. For every one of us who take the time to point this out, there are 10,000 of us experiencing utter contempt, like as if we just got an unexpected whiff of some hot garbage.
[*]And, I honestly don't think I'm even that smart.
you got lucky with the hardware. there was a bunch of wifi cards that wouldn't work in Linux because there were no drivers. and then ndiswrapper came along and let you use windows drivers in Linux. now that was a user unfriendly procedure of getting it working. some chipsets eventually got native drivers like ralink or b53 but getting things working was not easy!
There was absolutely zero luck involved. As I already wrote in the previous comment, I did something very simple. I sought out a WiFi card that already had Linux drivers and then purchased that WiFi card. I didn't have to "do anything" to get the WiFi card working.
Oh sweet, this laptop has a PCMCIA Wi-Fi card!
That'd be cool if one day I can get the laptop running on battery and not just on sector.
Let me just setup it.
Wait a second, how do I wake up the screen again and get out of this hibernation stage ?
Why are all the fans stuck in 100% now ?
Errr, first let's see if I can get the trackpad working.
On please, if it were that tough then teenage me never would have managed it. 20 years ago, e.g. 2004 (I first installed it in 2001), installing Linux and getting networked was already user friendly. The only hitch I ever had was figuring out ndiswrapper, but my ethernet cards all worked "out of the box" and installers handled the bootloader without users even having to know what a bootloader was. It's not like 20 years ago was the 90s or something, and the dark days of Windows lasted well into the 00s.
Agree.I also remember those days when it was so hard to get Linux to just boot up and get your display working correctly- it was almost like a rite of passage. It was just proving grounds for how much of an expert you were and the number of hours you spent in front of the PC, just to get things working.
My point is, good and bad memories will always stand out.
There's only been a few really bad ones, but Microsoft botch Windows updates quite regularly.
>but Microsoft botch Windows updates quite regularly
OK, please show us the proof then. If it's as regularly indeed like you claim then it must be documented somewhere as a greppable list.
Tech blogs would have a field day getting traffic on their site by keeping track and documenting on such regular mistakes if they exist.
Here's >100 of them in the past ~8 months:
https://www.manageengine.com/patch-management/resources/micr...
Where can I find a list for all OSes? I’d assume such a list would have known issues with X11 etc. I want to ensure it’s not a case of surviviorship bias.
I don't think there is one... macOS doesn't have enough functionality-breaking updates to make a significant list, and Linux/BSD-based distros generally do cleanly segmented updates to individual apps and services rather than Microsoft's great big monolithic all-or-nothing OS update bundles that touch on dozens of services at the same time.
Here’s a quick 2 minute search on Google for each.
- https://www.macworld.com/article/671831/macos-wont-install-f...
- https://askubuntu.com/questions/1231849/how-to-fix-update-pr...
My own anecdote: When I got my M3 Pro in April and had to start afresh, it was stuck in a restart loop and had to take it to the Genius Bar; they asked me to answer ‘no’ to some question that I was answering differently. That was it. I have no idea on the root cause or why it was fixed this way. I don’t remember the exact screen where the answer was supposed to be different.
It's frequent enough that people pay money for AskWoody[1] to tell them when it's safe to patch or what patches to skip.
[1] https://www.askwoody.com/ms-defcon-system/
Quote, from the website:
"In general, I apply Windows Defender updates as soon as they’re available. Why? Microsoft hasn’t screwed up any of them too badly. You’re better off applying those updates than letting them slide for a week or two."
Yep, Microsoft does a good job with Windows Defender (antivirus) updates.
It's the other Windows Updates that they botch frequently enough to make people wary of patching immediately.
Anyone who worked in IT knows this, it is not something rare. Literally every month, for example one from last month:
https://www.techradar.com/computing/windows/windows-11-updat...
This is the main reason every IT professional I know disables auto updates of windows and manually trigger updates after testing (hopefully) on multiple dummy machines on the network.
I personally remember booting to safe mode to remove Windows updates to rescue the computers more than I can count.
Examples like that one I also found, but that's not really a "looooong list". If people can only show one single example as an argument it's kind of a moot point.
You'd experience at least 3-5 per year if you work in IT. There really is a long list but since it is not my argument, I won't list them after searching for an hour. The list starts early 2000s, not recent.
EDIT: Whatever, I will do the search for you since you cannot use google:
https://www.pcgamer.com/an-odd-bug-in-this-months-windows-10...
https://www.windowslatest.com/2023/10/22/windows-11-october-...
https://www.bleepingcomputer.com/news/microsoft/windows-10-e...
https://www.windowslatest.com/2023/02/09/microsoft-confirms-...
https://www.windowslatest.com/2023/07/16/windows-11-kb502818...
These are just the last quarter of 2023. There is over 2000 news but I won't link them Use keywords: Windows Update, Crash, and use the date option on google go before 2023.
All you could find were 4 examples in 2023? Hardly a long list, wouldn't you say?
I think my Android updates caused way more issues in one year and that's running an immutable HW that's well know and understood by the manufacturer, so 4 issues per year for Windows doesn't sound too bad, even though I had zero in 2023.
https://en.wikipedia.org/wiki/Moving_the_goalposts
Well, from the news this morning:
https://www.forbes.com/sites/daveywinder/2024/07/27/microsof...
Experience is the best teacher
Attention to teacher is not equal between learners, trying to thoroughly assimilate the lesson is not everyone move, self challenging oneself with actual tests to ensure skill acquisition is rare, and going through the whole rabbit hole to figure out what untold assumptions the teacher leverage on and understanding the limits of these suggestions is the way only a few exceptional beings will follow.
Is MS doing it properly these days though?
If they are, then you could be right. :)
And they've learned a lot from it. For example, MS no longer universally deploys updates across the world, they have a slower rollout to avoid just such an incident.
Yeah now one million users loose access to their computer instead of 100 million!
yes? that's 100x better! at the end of the day, internal testing just isn't going to catch every single permutation of customer configuration, so there's always a risk that something bad goes out. if you're that big, you'd start with .01% of the fleet instead of 1% of the fleet, so it's 100_000 before you get to 1_000_000, before going to 100% but neither Apple or Google have figured out a better way than that. It's industry standard at this point.
Yes, quite the epitome of throwing stones from a glass house.
I must disagree with that take, your last quoted sentence is in response to all the supposed self-proclaimed experts asking "why does it need kernel access", the ones before that is to limit their own liability.
What I've heard from people in the industry is not this silly "oh no, crowdstrike is so incompetent" b.s. that is being spread on sites like HN and reddit but more of an empathic "it could have been us" sentiment. In this write up as well, Microsoft knows they have caused their share of outages, it is a technical write-up but in part, it is to cover their bases for government investigations and lawsuits that will arise from this incident.
And in part, they are also responsible for recovering from third-party driver errors and repeated boot failures caused by faulty drivers.
CrowdStrike blamed their test software, but in the same breath revealed that they haven't been using any canary deployments. The bug that caused all this was present in their kernel driver for a long time.
For being such a large cybersecurity player and deploying updates to 8.5 million devices, their quality control practices are embarrasingly lacking.
Every company I've ever been at rolls out updates slowly. Rolling out a change to 8.5 million computers at the same time seems ridiculous. Even the most cash strapped start-ups with every incentive to cut corners tends to get staged roll-outs more or less right. It's crazy.
I had a fleet of only maybe 200 computers I updated remotely. I did canary staged roll outs.
not a software update!
Not relevant!
details are always relevant in a technical discussion. look at my other comments where i pointed out microsoft performing similar immediate av signature updates and causing chaos.
Some details are relevant, some are not.
I'm more than comfortable labelling parts of Microsoft as incompetent as well.
We can agree on that, but it is relevant because this isn't an unusual practice. Crowdstrike didn't ignore some pre-existing best practice. Lots of things need improving but facts and details matter when you talk about RCA. it isn't about blame but fixing the root cause.
When I managed ~ 15 developer’s Arch Linux workstations, I found it very beneficial to be the canary, and then rollout to a couple of the more capable of troubleshooting devs, and then the rest. I can always fix my own box.
8.5M all at once feels insane.
again, this is why I was snarky in my earlier post, this was not a software update. they should have used canary deployments still but in many cases prior to this incident, it was not acceptable to wait even a few hours because it can make the difference between companies getting ransomwared/hacked, so they focused on making the actual code/driver that interprets the channel file updates robust enough to handle real-time updates. Even if other players were doing canary deployments with behavioral detection updates, they're not the market leader, crowdstrike is for a reason.
Everyone that worked in an operational incident response role has blocked some indicator like an ip address or a domain. you don't do gradual roll outs for those either, and i've seen people cause outages by skipping a check or making a mistake. this is similar in many ways to that except it was for a named pipe. This could probably have waited for a canary deployment, but in general the class of content that is being deployed would be deployed right away, I'd be surprised if their practice is considered "bad" by any measure. I've seen Microsoft also deploy email quarantine signatures and defender updates that caused large scale impacts.
Here is a link of what Microsoft did earlier this year:
https://www.techradar.com/news/google-chrome-not-working-mic...
If they had canary deployments, that wouldn't have happened. I had rules that were causing chaos because of that. Now imagine if defender had a bug that caused it to crash because of a signature update. The impact would be magnitudes greater than what you saw with Crowdstrike. It's really frustrating to see the lack of technical critical thinking and arm-chair experts acting like they know what they're talking about.
Let's say the driver was "robust enough" to handle a broken channel file. How would that look exactly? Say you're responsible for writing the code which loads a new channel file. These channel files are critical; without them, your security critical product doesn't know how to do its job. The channel file parser returns a parse error. How should the driver respond? Surely you're not going to just silently disable your security critical product if someone puts a bad channel file in there?
Delete the file or mark it as corrupt so that the parser doesn't keep trying to read it, and send some telemetry back to CS to indicate there is a problem with the one of the channel files. It doesn't seem very complicated at all. There are plenty of options in between "catastrophically crash the OS" and "silently disable the entire product".
That seems pretty dangerous if that channel file included security critical configuration, which it presumably did
Hours... Wouldn't a 15 minute canary have found this problem about 14 minutes before it hit wider deployment?
Beyond crazy. I even have a small app that never makes it to production before being rolled out to internal and open testing first. And, even then, it's slowly rolled out to a percentage at each stage before being fully deployed. One would think a major company with kernel level access would do this at minimum.
Clearly incompetence to deploy from 0 to 8 million devices without any gradual rollout.
That goes even further, because apparently they were fully blind and didn't have crash metrics.
"Ok we push the update, and pray".
I think it is past incompetence, and on into negligence. Given the stories we have heard here about emergency service failures it is likely that people died. When people die due to negligence isn't that usually criminal?
Who is negligent though? Crowdstrike, or the emergency services that are using an OS that requires third party endpoint security right out of the box in order to be safely used, or the company that makes and sells that OS?
Why not both?
Crowdstrike, for negligently not rolling out updates gradually.
And emergency services, if they don't have robust fallback procedures/systems for when their IT system goes down. I mean it's totally fine if regular doctor's visits get postponed, but 911 should never go down just because their computers down. Just like aircraft have redundant systems, so too should 911.
(The company that makes and sells the OS -- I don't see any negligence there, in this case. If security software fundamentally requires running at the kernel level and Microsoft allows that, I don't see how Microsoft can be at fault.)
Yeah, I don’t see how one can blame Microsoft in this scenario. If you choose to run buggy kernel-level code, that’s on you, not the publisher of the kernel/OS. Especially when the code you’re running is a replacement for functionality already provided by the OS. It’s hard to argue that MS could be negligent for “not having a good enough AV/endpoint protection solution” or “allowing customers to run kernel-level code.”
It’s hard for people to understand that these massive ‘security’ enterprises are often connected by a large amount of bodies instead of competence.
Can't agree more, you found the right words.
https://www.techradar.com/news/google-chrome-not-working-mic... ,not an unusual practice and they were not first av company to cause outages. and again, it was not a software update, the buggy software was deployed after testing back in march. Details matter!
How about we let the lawyers figure out who had what liability, just like with the av/edr industry, we should know when the subject matter is outside our area of knowledge and expertise.
And this is how the lawsuits will start.
I shared with a sibling commenter:
https://www.techradar.com/news/google-chrome-not-working-mic...
Did Microsoft do a staged or canary roll out with that? This is not a software update, if you're making such comments then you're speaking about something outside of your field of expertise.
Their post-incident report [1] also stated that they intend to improve testing by "using testing types such as: local developer testing". One has to wonder what, if any, testing they were doing beforehand.
[1]: https://www.crowdstrike.com/blog/falcon-content-update-preli...
Well we know what the testing is, don’t we?
The update literally crashed the system it was used on.
There’s no way they couldn’t know that unless they never ran it. Right?
Is this one of those things that only happened to 10% of users? Because I haven’t seen that reported anywhere.
As far as I'm aware, it affected all systems using Crowdstrike.
Unless their developers had room temperature IQs or were actual psychopaths, I really wonder how they even managed to find developers who had the nerves to deploy to the whole world all at once like that. If it were me I'd be scared shitless, covered in sweat and probably shaking too hard to even type. Were CrowdStrike developers too stupid to even realize the magnitude of what they were doing? Or did they have cooler nerves than an open-heart surgeon? It's shocking to me that they could have done this so casually.
More likely they were following a playbook to the letter, and were therefore 100% of success.
Anyone in the industry could have a bug get through testing.
Some companies could have a severe and readily reproducible bug get through testing.
A few of those companies have a hand-rolled update mechanism, and can accidentally break their ability to roll back a bad release.
A few of those companies are in a position to push a release that breaks not only their own software, but the entire OS.
Very few companies in that position would roll out to 100% of client machines in a single worldwide deployment.
Microsoft should be sued, for literally having blood on their hands. There was an easily mitigated design flaw in Windows that would have greatly blunted the impact.
https://news.ycombinator.com/item?id=41095788
If "it could have been them", then I would like to read such professionals write exactly about how to avoid having a global outage like this again, rather than "showing empathy" with a corporation. Or do we just leave it up to luck, and if "it happens to them too" in a month or year, oopsies? What about which practices could be improved?
this isnt even the first time its happened. Crowdstrike has killed an OS every month for the past four months.
At this point they are a threat actor. if you havent kicked their amateur-hour software out of your infrastructure by now, chances are good senior management and engineering have at least considered it formally.
https://en.wikipedia.org/wiki/CrowdStrike#Severe_outage_inci...
That incident list is damning. Is senior leadership asleep at the wheel, or how can this many incidents possibly happen every 30 days for months on end? If leadership really cared, they'd make sure post-mortems and other best practices are in place to reduce the frequency.
Unfortunately, the executive disconnect isn't new. It's actually uncommon that they care about the reality for end users and customers (which is antithical to my entire ethos, hence why I get paid the medium bucks). Why bother waking up and going to work everyday unless you are contributing in some way to sustaining a better future for everyone? It's actually great for marketing and it's already going to be a tough 100+ years from today for our children, even with our collective care.
P.s. People can be so selfish, it kind of breaks my brain but not really. Have you seen the CO2 emissions visualization from NASA this week? It was a wakeup call for me.
'Tremendous' NASA Video Shows CO2 Spewing from US into Earth's Atmosphere https://www.newsweek.com/nasa-video-carbon-dioxide-co2-emiss...
It's concerning.. and caught no traction.. http://news.ycombinator.com/item?id=41064029
here’s a fun connection: https://x.com/anshelsag/status/1814426186933776846
“ For those who don't remember, in 2010, McAfee had a colossal glitch with Windows XP that took down a good part of the internet. The man who was McAfee's CTO at that time is now the CEO of Crowdstrike. The McAfee incident cost the company so much they ended up selling to Intel.”
so yeah, “leadership” (and that’s a loose term) doesn’t seem supremely concerned about much more than earnings
Not to worry, McAfee CTO was not actually in charge of technology
https://archive.is/20240724213623/https://www.barrons.com/ar...
The fish rots from the head.
Also what the fuck is a sales-facing CTO??
I'm suspicious of CrowdStrike now. If we rip the cover off would we find that it's little more than a reskin of McAfee?
Sometimes it’s good to take a little break after working for a company that ended up not representing your values.
I’m on #2 now and it’s been great. It’s like a breakup. “What was I thinking?”
Of course if it is representing your values and your values are purely mercenary, it’s really not going to change anything.
The Internet is able to transmit odors of rotting flesh????
Recently ordered an HP laptop for some light work (not my startup), and when placing the order said don't include McAfee, that "I don't trust them", all just from some odor!
CloudStrike runs in kernel mode? No wonder there are problems; kernel mode sounds like more of a threat than a protection.
Sooooo, for my Web server(s), McAfee and CloudStrike are issues I get to ignore. Problems avoided and time, money, energy saved!! Simple.
McAfee the company or the person? Because John McAfee was pretty out there...
https://www.businessinsider.com/john-mcafee-tweet-said-his-s...
Perhaps of a symptom of the "Everyone is in sales" brain damage so pervasive in companies now.
Have seen region-specific Field CTO roles partner with GTM teams to co-sell with customers. Product and role domain expertise without the organizational technology responsibility.
I assume T stands for [Sales and Marketing]Technology. Which makes perfect sense because these are their core departments that the whole company is dependant on.
The product itself is a secondary cost-center, probably less important than even accounting.
Is that what happens when a company has so many Sales Engineers that they become a parallel department from regular Engineering?
Now that's interesting. I wonder why neither here nor there anybody mentions GK's name. Fear of litigation?
IMO somebody who managed to collapse the most important infrastructure on earth twice in as many decades - not a small feat, I have to admit - should be known by name to the general public, lest he'll get another chance at it.
I haven't seen any important infrastructure on earth collapse, neither in 2010, neither in 2024.
Tell that to the people whose surgeries were cancelled because of computer issues.
That was still not on of the most important piece of infrastructure on earth.
And outages were not as global as news outlets made it look to be. Crowdstrike may have been ubiquitous in some countries, but almost absent in others. And still, crowdstrike or windows windows aren't global pieces of infrastructure.
I admit that was a bit of hyperbole. My point stands regardless.
Tech needs something like the FTC that can ban someone from working in that area after multiple demonstrations of glaring incompetence. Or evil misdirection of competence.
McAfee incident 2010 https://www.zdnet.com/article/defective-mcafee-update-causes...
Presumably it doesn't matter that much and isn't worth spending money/manpower on?
If the usefulness/quality of their software has no influence on their potential customers decision making process. why bother?
It would make much more sense to allocate any excess resources to the departments that do actually matter like sales and marketing.
Well, if they think any of the $20B of shareholder value lost recently has to do with the quality issues... Then perhaps they should reconsider. (keep in mind marketcap also represents their ability to raise capital in the future with more/less dillution)
From your linked article:
Nice to see Wikipedia has devolved even further into a dumpster fire in that they are now citing random HN posts as authoritative sources of facts.
Wikipedia is not an individual actor or a hivemind, so there is no capital-T "They". It's a system of multiple people each acting on their own accord. For a developing news story like this, I find this type of sourcing acceptable, especially because it is cited as "some person on the internet claims", not as "it is true that".
If you disagree with this choice of source, you can flag this part as needing better sources. The simplest way to do so is to just leave a comment on the talk page.
Never assume malice where incompetence will suffice. I have worked on teams where we could not get the basics like a test or integration environments signed off for months yet the managers expected us to go to production. Suffice to say production was also not signed off for half a yer and we had to improvise. I wonder is something similar was at play at CS?
Never assume incompetence when greed will suffice.
Staffing problems?
Management often sees, “I have a dozen people on this.” When in fact the bus number was three, you laid one off, another quit and the third is sick or having life struggles.
"I have a dozen people on a dozen different things."
Or maybe crowdstrike is dealing with the hardest threats and hence ends up having to rollout stuff rapidly against zero-days?
Not a CS fanboy, but just wanted to suggest an alternative to sheer incompetence
Yeah, but doesn't MS have to sign every kernel mode driver? They've allowed Crowdstrike's foot gun to continue to live in the kernel.
It didn't read as particularly diplomatic to me. In particular, this paragraph..
> It is possible today for security tools to balance security and reliability. For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement limiting exposure to availability issues. The remainder of the key product functionality includes managing updates, parsing content, and other operations can occur isolated within user mode where recoverability is possible.
...was about as close to tetchy as a post like this would ever get. Basically they are saying "there was no good reason at all why CrowdStrike had to put so much code inside the actual kernel." And with the benefit of hindsight, it's a strong point.
Their business is corporate spyware to surveil employees, ofcourse they'll use any tactic to make it work, that's the why. And their EULA states there is no liability for the company:
https://www.crowdstrike.com/terms-conditions/
Dirty policies on top of dirty practices.
>Their business is corporate spyware to surveil employees
What?! Anything you do on your corporate provided laptop is always gonna be logged by IT for security in every large company everywhere, that's news to nobody, but your company doesn't care that you use your corpo laptop to book your vacation, IT has better things to do than narc on you for that.
If your boss wants to actually spy on you they don't need Crowdstrike, there's other SW dedicated for that depending on the laws in your jurisdiction but that' not what Crowdstrike is for.
If you want complete privacy from your employer, just use your personal machine for your private activities instead of your work laptop, why is this so hard?
Speak for yourself. There are still companies who don't treat their employees like idiots and actually trust them. Let's not normalise pervasive surveillance.
>There are still companies who don't treat their employees like idiots and actually trust them.
Yeah sure, but wow many of those are large non-tech companies?
You massively overestimate the tech competency of the average PC user if you think it's normal in most companies to not have security monitoring solutions in place or over the internat activity. In our latest phishing test IT did, several users fell for the trap, despite it being a tech company. There's always gonna be someone careless one day and companies want insurance policies against that.
Having such solutions in place doesn't mean the company doesn't trust you, it's more like that old Russian proverb, "trust but verify", and for ticking security compliance boxing as an insurance policy.
Everyone makes mistakes, it's only human. So more like, speak for yourself, if you think your internet activity at work isn't logged anywhere.
I think there’s an inflection point where the company has grow so big it becomes impossible to trust every individual employee.
It won’t be about distrusting anyone in specific either, but something will go wrong for which you need to be monitoring every PC to find out what is going wrong.
Yep, there are better tools for spying, like Teramind and Aktivtrak.
There are better tools for spying like Teramind and Aktivtrak. Crowdstrike would make a bad spying tool. I guess there is remote CMD? And you can like, see all installed programs.
But so can SCCM/Intune from MS or another RMM like Datto that IT uses to manage PCs...
MS should have something like Project Zero for Windows applications and drivers. Any app on more than 1-5% PC should be tested and fuzzed and ... And the vendor than pressured into fixing the issues. Even if it is not technically their fault, it is definitely optics problem for MS, half of the world refers it as Windows blue screen issue.
How would Microsoft apply pressure? Short of publicly shaming them what power do they have?
umm. Give a x days deadline and make after it public like Project 0 works, threaten to take away "Verified by MS" badge or create a WhatsApp group of Fortune 500 CIOs and badmouth in it.
Both of these have legal percussions: Microsoft could very well be called a competitor of CS, so they cannot force them to do something without getting accused of abusing their market position; and a publicly traded company badmouthing another publicly traded company with an awfully complex web of mutual investments is a very bad idea in general.
It’s not that easy.
Raymond Chen: That Time We Bought EVERYTHING at Egghead.
https://youtu.be/6m_Im7J9Iaw?si=q8jLBefEdgm-PrrZ
Blogpost version: https://devblogs.microsoft.com/oldnewthing/20050824-11/?p=34...
People wouldn't need CS if Windows was better designed to begin with...
Care to elaborate?
How would a better designed Windows eliminate the business & compliance need for installing software like CS? And why hasn't that already happened?
I would think Microsoft and CS' customers have an incentive to not have such third party software on their system if possible.
Why are they being diplomatic, instead of plainly stating their contempt and revoking CS's driver/etc signing keys? Doing so would help to repair the reputational harm that CrowdStrike inflicted on Windows.
Are their lawyers telling them they can't impede CrowdStrike even though CrowdStrike is breaking Microsoft's product? They should do it anyway and dare CS to take it to court so they can publicly humiliate CS by dragging all the dirty details of their incompetence out.
People are free to install kernel modules. It shouldn’t be up to microsoft to stop them from doing so.
Microsoft tried to push back on vendors wanting kernel access in 2006 <https://arstechnica.com/information-technology/2006/10/7998/>
Microsoft has (somewhat correctly IMNSHO) pointed at the EU agreement that forced them to open the kernel up to third parties as being a factor in the CrowdStrike catastrophe. <https://www.theregister.com/2024/07/22/windows_crowdstrike_k...>
From the latter:
It's a little ironic they are taking the high ground on safe rollout practices when they had an Azure/365 outage caused by a bad config at the same time as the CS incident. Though to be fair, it only affected US central.