I appreciate that we’re finding the humour in this catastrophe but what about the question of liability? I have seen a few stories on HN of the billions lost by this event but so far not much in the way of lawsuits.
What is the situation? Are the licenses so ironclad that customers have no recourse? I could understand this in the case of consumers who might suffer minor inconvenience as their home PC is out of service for a few hours/days but it seems totally unacceptable for industries to accept this level of risk exposure.
This is one of the big reasons civil engineering is considered such a serious discipline. If a bridge collapses, there’s not only financial liability but the potential for criminal liability as well. Civil engineering students have it drilled into their heads that if they behave unethically or otherwise take unacceptable risks as an engineer they face jail time for it. Is there any path for software engineers to reach this level of accountability and norms of good practice?
Delta threatened to sue them for their $500M loss. Crowdstrike replied (publicaly) pointing out that their contract limits Crowdstrike's liability to single digit millions.
Then then gave them a list of things they would seek in discovery, such as their backup plans, failover plans, testing schedules and results, when their last backup recover exercise was, etc.
Basically, they said, "if you sue us, we will dig so deep into your IT practices that it will be more embarrassing for you than us and show that you were at fault".
But CrowdStrike said this publicly. If they’d privately relayed it to Delta, it would have been genuine. By performatively relaying it, however, it seems they’re pre-managing optics around the expected suit.
It doesn't matter it was 100% crowdstrikes fault. Surprised its still worth 60billion dollars.
Yes and no.
Crowdstrike was the executioner of this epic fail for sure but their archaic infra practices made it even worse. Both Crowdstrike and Microsoft CEOs reached out only to be rebuffed by Delta's own. If I was the CEO - I'd accept any help I can get while you have the benefit of the public opinion.
/tin-foil-hat-on Flat out refusal for help makes me think there are other skeletons in the closet that makes Delta look even worse /tin-foil-hat-off
If you held the view that CrowdStrike and Microsoft were inherently to blame for the problem why would you trust them to meaningfully help? At best they're only capable of getting you right back into the same position that left you vulnerable to begin with.
Same reason why an aircraft manufacturing company would get involved in a NTSB investigation when there is an airplane crash. Just because they messed up one or more things (i.e. MCAS on MAX) doesn't mean they can't provide expertise or additional resources to at least help with the problem.
Your take also casually disregards the fact that Delta took an extraordinary time to recover from the problem when the other companies recovered (albeit slowly). This is the point that I'm getting at. It isn't that CS and MS aren't culpable for the outage; it's that DAL also contributed to the problem by not adequately investing in its infra.
Key difference here is that the NTSB is third party with force of law behind it. The victims in the crash – airlines and passengers – aren't rushing to the aircraft manufacturer to come fix things. Quite the opposite: the NTSB and FAA have the authority to quarantine a crash site and ensure nobody tampers with the evidence. Possible tampering with black boxes was an issue in the investigation of Air France Flight 296Q.
Being to blame is different than being actively trying to sabotage you. Many companies will be re-evaluating their relationship after this problem happened, but doing that while your systems aren't functional seems counter-productive.
I’d reserve judgement. Delta may have been cautious about giving the arsonists a wider remit.
Using your analogy - if MS/CS are the arsonists, then Delta are the landlords unsafely storing ammonium nitrate in their own warehouse.
Their lack of response to MS/CS isn't coming from a place of reducing potential additional problems but trying to shield their own inadequacies while a potential lawsuit is brewing in the background.
https://www.reuters.com/technology/microsoft-blames-delta-it...
In this case, the fire was an accident, and the arsonists happen to be the expert firefighters, and they're very motivated to fix their mistake. They're still the experts in all stuff fire, whereas Delta is not.
It doesn't seem like arsonist is the right word. It implies it was intentional, which as far as I can tell there is no proof of.
I think the more accurate description would be some firefighters were doing a controlled burn. The burn got out of controlled and then you say that you don't want the firefighters help in put out the fire.
Part of the problem is assuming you can pay a contract to shift your liability completely away.
The assumption is not only perfectly valid, it's the very reason such contracts are signed in the first place! It's what companies want to buy, and it's what IT security companies exist to sell.
Yes, I know that's what everyone wants/thinks, but you actually can't do it. Because at the end of the day, you chose the vendor. So you are still liable for all of it.
Right, the risk structure presumably protects the vendor if just one customer sues, even if the amount of damages claimed is astronomical. Because vendors try to disclaim bet-the-company liability on a single contract.[1] The vendor's game is to make sure the rest of the customer base does not follow this example, because as noted in the linked article while vendors don't accept bet-the-company liability on each contract (or try not to), they do normally have some significant exposure measured in multiples of annual spend.
[1] https://www.gs2law.com/blog/current-trends-in-liability-limi...
Well if MSFT knew how to write MSAs Crowdstrike would have become property of Microsoft.
It’s an argument that hits home at any bigcorp where the execs are entertaining the thought of suing CrowdStrike. Making it public once is a lot more effective than relaying it privately a hundred times. I expect most liability to come from abroad, where parts of the contract might be annulled because not in line with local law. But still I don’t expect it. CrowdStrike delivered the service they promised. The rest is on the customers IT. Hand over the keys and your car may be driven.
Maybe? Discovery is a core element of any lawsuit. It’s also a protected process: you can’t troll through confidential stuff with an intent to make it public to damage the litigant.
If anything, I could see Delta pointing to this statement to restrict what CrowdStrike accesses and how [1]. (As well as with the judge when debating what gets redacted or sealed.)
[1] https://www.fjc.gov/sites/default/files/2012/ConfidentialDis...
Thank you. Nice read. Even given a protective order to keep discovery confidential, the ensuing discussion about the clients lacking IT-policies that exacerbated this crisis is public.
Most entertaining would be the discussion where CrowdStrike would argue that based on common IT-risk criteria, you should never hand over the keys to an unaudited party not practicing common IT-risk best practices and (thus) the liability is on the organization. Talk about CrowdStrike managing risks worldwide. They are doing it right now!
Seems fair. Delta didn't privately relay their intentions.
Or attempting to discourage it from becoming a pile-on.
Weirdly, we live in a society
It really seems funny that Crowdstrike’s defense is basically “you should have been better prepared for us to knock all of your systems offline.”
It’s probably true, but seems like an odd stance to take from a PR perspective or a “selling other clients in the future” perspective.
In the case of Delta, their outage was much longer than everyone else because they refused help from both Crowdstrike and Microsoft. So their defense is basically "the damages could have been mitigated if you'd listened to us".
Link?
Anyway I find it highly amusing that Delta is seeking damages from Microsoft even though Microsoft had nothing to do with it.
Delta's position is the Microsoft actively recommended and coordinated with CrowdStrike to the extent that they are co-responsible for outcomes. In a large enterprise like Delta, the vendors do work together in deployment and support. Yes, there's often a great deal of finger-pointing between vendors when something like this happens, but in general vendors so intimately linked have each other on speed-dial. It would not shock me to learn that Delta has email or chat threads involving CrowdStrike, Microsoft, and Delta employees working together during rollouts and upgrades, prior to this event.
As far as refusing help, why is that funny? If someone does something stupid and knocks you down, it's perfectly reasonable to distrust the help they offer, especially if that help requires giving them even more trust than what they've already burned.
Yeah it smacks of Experian offering you a year of "free identity theft protection" after having lost your personal data in a breach.
Changing vendors and choosing one that's more reliable is a perfectly sensible outcome of this situation once your system are back up and you're no longer hemorrhaging money.
During an ongoing incident, when all of your operations are down, is not the time for it though. If you think there's even a 1% chance that the help can help, you should probably take it and fix your immediate problem. You can re-evaluate your decisions and vendor choices after that.
There are many articles about them refusing help, but here is one:
https://www.theverge.com/2024/8/6/24214371/microsoft-delta-l...
That's kind of typical of how much companies have been allowed to externalize costs. It's never about how the company at fault should have done better, rather it typically boils down to some variant of "the free markets provided you with a choice about who you trust and it was up to you to collect and evaluate all the information available to make your choices".
That’s kinda what aws tells people when its services go down. If your backend can’t take a short outage without weeks of recovery then it’s just a matter of time.
Delta threatened to sue them for their $500M loss. Crowdstrike replied (publicaly) pointing out that their contract limits Crowdstrike's liability to single digit millions.
Delta's move seems like an attempt to assuage shareholders and help the C.E.O. save face.
Crowdstrike shouldn't be afraid of Delta. Crowdstrike should be afraid of the insurance companies that have to pay out to those businesses that have coverage that includes events like this.
Even if the payout to a company is $10,000, a big insurance company may have hundreds or thousands of similar payouts to make. The insurance companies won't just let that go; and they know exactly what to look for, how to find it, and have the people, lawyers, and time to make it happen.
Crowdstrike will get its day of reckoning. It won't be today. And it probably won't be public. But the insurance companies will make sure it comes, and it's going to hurt.
It could be as simple as a reinsurer refusing to renew coverage if a company uses CrowdStrike.
Availability (or not) of insurance coverage is surprisingly effective in enabling or disabling various commercial ventures.
The penny dropped for me whilst reading James Burke's Connections on the exceedingly-delayed introduction of the lateen-rigged sail to Europe, largely on the basis that the syndicates which underwrote (and insured) shipping voyages wouldn't provide financing and coverage to ships so rigged.
Far more recently we have notions of redlining for both mortgage lending and insurance coverage (title, mortgage, property, casualty) in inner-city housing and retail markets. Co-inventor of packet-based switching writes of his parents' experience with this in Philadelphia:
"On the Future Computer Era: Modification of the American Character and the Role of the Engineer, or, A Little Caution in the Haste to Number" (1968)
<https://www.rand.org/pubs/papers/P3780.html> (footnote, p. 6).
Similarly, government insurance or guarantees (Medicare, SSI, flood insurance, nuclear power plants) has made high-risk prospects possible, or enabled effective services and markets, where laissez-faire approaches would break down.
I propose that similar approaches to issues such as privacy violation might be worth investigating. E.g., voiding any insurance policy over damages caused through the harmful use or unintended disclosure of private information. Much of the current surveillance-capitalism sector would instantly become toxic. The principle current barriers to this are that states themselves benefit through such surveillance, and of course the current industry is highly effective at lobbying for its continuance.
That’s interesting because on the TV episode, it states that insurers wanted the risk of piracy spread out over many smaller ships that would be lateen rigged. I have one of the Connections books, so I’ll check to see if this is covered in it https://youtu.be/1NqRbBvujHY?si=WfysDHPLhSJkGhzd
Interesting discrepancy, yes. I'm pretty sure of my recollection of the book.
It may be that the opportunity to diversify risk (over more smaller ships) overcame the reluctance to adopt new, untested and/or foreign technology.
It doesn’t explicitly say insurers but it’s a pretty small logical leap from the wording (the timeframe is also c. 11th-12th century so could be before formal insurers)
Right.
The books and video scripts also differ amongst Burke's various series. I'll see if I can find a copy of the text to compare.
Which would be funny, since many companies are putting up with Crowdstrike to make insurers happy.
That’s odd. One is an internal process which has no obligation to an external party, and the other one who is specifically responsible for being liable for any repercussions due to deviating from their own SDLC process[1]they totally skipped themselves?
If I were Delta, I’d get other affected parties and together sue CrowdStrike and get all their dirty laundry out in the open.
[1] I haven’t checked but they used to list all their ISO certs, etc. Wonder if those get revoked for such glaring violations…
Delta has obligations to their passengers and similarly sidesteps screw ups with similar contractual provisions. How much would Delta owe for not following similar IT practices? Do they now owe customers for their IT failings? Should customers now get to sue Delta for damages related to their poor IT recovery compared to other airlines?
Sure but that’d be something passengers could bring up in a suit against Delta, not someone like CS, who themselves obviously skipped their own internal SDLC and whatever other ISO certs they prominently advertised on their website.
I assume the argument is that if they can show negligence in their IT practices, then the $500 million in damages can't be all attributed to CrowdStrike's failure.
Crowdstrike's discovery process would greatly aid in passenger or general-public suits against Delta.
Civil suits focus in a large way on determining how much damage is each party’s fault. So Crowdstrike would be saying “Of this $500M in damages, x% was from your own shitty practices not from our mistake”. Thats why it’s all pertinent.
Correct. The legal term is “contributory negligence.”
I’m not sure anything else was material given that the machines were bricked and client roll-out approaches were evaded by Crowdstrike. What client actions would have helped?
Surely someone is looking at a class action? People died. The contract can’t make that everyone else’s problem, can it?
If someone's life depends on a networked Windows (or any similar OS) machine you chose to run for that purpose, you are the criminal.
Indeed. But this is how hospitals run.
Sure it can. If every rock climbing company in the country decides that climbing ropes are too expensive and instead decide to by rope from the local hardware store, and that rope has a warning reading "not for use when life or valuable property is at risk", then it is 100% on those climbing companies when people die, because they were using a product in a situation that it was simply not suitable for.
The details, of course, depend on the contract and claims that Crowdstrike made. But, in the abstract, you are not responsible for making your product suitable for any use that anyone decides to use it for.
If a hospital wants to install software on their life critical infastructure, they are supposed to buy software that is suitable for life critical infastructure.
I'd LOVE to see Crowdstrike do this. The last time I dealt with the specifics of this sort of validation testing for security software was a decade and from what I saw in the RCA Delta can just keep pointing out that whatever they had worked until Crowdstrike failed to understand that the number 20 and the number 21 are not the same:
The new IPC Template Type defined 21 input parameter fields, but the integration code that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against. This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field.
This combined with the lack of partitioning updates, makes me draw the conclusions they're missing table stakes WRT to validation.
Wtf how do you not check for ‘quantity of arguments’ in QA testing?
They should be providing all that information regularly to auditors anyway. If they don’t have it handy, then their IT leadership should be replaced.
That’s not the way legal process works. CrowdStrike might be permitted to conduct discovery, but that won’t entitle them to share what they might find with the public, embarrassing or otherwise. Business records and other sensitive information relating to parties in civil matters are frequently sealed.
They might find out delta does embarrassing things like not testing out of bounds array access or does global deployments without canarying.
The problem is that with civil engineering you're designing a physical product. Nothing is ever designed to its absolute limit, and everything is built with a healthy safety margin. You calculate a bridge to carry bumper-to-bumper freight traffic, during a hurricane, when an earthquake hits - and then add 20%. Not entirely sure about whether a beam can handle it? Just size it up! Suddenly it's a lot less critical for your calculations to be exactly accurate - if you're off by 0.5% it just doesn't matter. You made a typo on the design documents? The builder will ask for clarification if you're trying to fit a 150ft beam into a 15.0ft gap. This means a bridge collapse is pretty much guaranteed to be the result of gross negligence.
Contrast that to programming. A single "<" instead of "<=" could be the difference between totally fine and billions of dollars of damages. There isn't a single programmer on Earth who could write a 100% bug-free application of nontrivial complexity. Even the seL4 microkernel - whose whole unique selling point is the fact that it has a formal correctness proof - contains bugs! Compilers and proof checkers aren't going to complain if you ask them to do something which is obviously the wrong thing but technically possible. No sane person would accept essentially unlimited liability over even the smallest mistakes.
If we want software engineers to have accountability, we first have to find a way to separate innocent run-of-the-mill mistakes from gross negligence - and that's going to be extremely hard to formalize.
To add onto this, the Pwnie Awards also go to people who get attacked, which is something that e.g. civil engineers certainly don't get blamed for (i.e. if a terrorist blows up their bridge).
We would need a way to draw a liability line between an incident that involves a 3rd party attack, and one that doesn't, but things like SolarWinds even blur that line where there was blame on both sides. When does something become negligence, versus just the normal patching backlog that absolutely exists in every company?
And why are people aiming the gun already at software engineers, rather than management or Product Architects? SE's are the construction workers at the bridge site. Architects and Management are responsible for making, reviewing, and approving design choices. If they're trying to shift that responsibility to SEs by not doing e.g. SCA or code reviews, that's them trying to avoid liability.
Honestly, this reaction by the CEO is great for taking responsibility. Even if there's not legal liability, a lot of companies are still going to ditch CrowdStrike.
There's a really big difference though. In the physical world, an "attack" is always possible with enough physical force -- no matter how good of a lock you design, someone can still kick down the door, or cut through it, or blow it up. But with computer systems, assuming you don't have physical access, an attack is only possible as a result of a mistake on part of the programmers. Practically speaking, there's no difference between writing an out-of-bounds array access that BSoD's millions of computers, and writing an out-of-bounds array access that opens millions of computers to a zero-day RCE, and the company should not be shielded from blame for their mistake only in the latter case because there's an "attacker" to point fingers at.
Over the past few years of seeing constant security breaches, always as the result of gross negligence on the part of a company -- and seeing those companies get away scot free because they were just innocent "victims of a cyberattack", I've become convinced that the only way executives will care to invest in security is if vulnerabilities come with bankrupt-your-company levels of liability.
Right now, the costs of a catastrophic mistake are borne by the true victims -- the innocent customer who had their data leaked or their computer crashed. Those costs should be born by the entity who made the mistake, and had the power to avoid it by investing in code quality, validating their inputs, using memory-safe languages, testing and reviewing their code, etc.
Yes, we can't just all write bug-free code, and holding companies accountable won't just stop security vulnerabilities overnight. But there's a ton of room for improvement, and with how much we rely on computers for our daily lives now, I'd rather live in a world where corporate executives tell their teams "you need to write this software in Rust because we'll get a huge discount on our liability insurance." It won't be a perfect world, but it'd be a huge improvement over this insane wild west status quo we have right now.
Having such consequences would completely stop any innovation and put us into a complete technological stagnancy.
Which would of course result in many other and arguably much worse consequences for society.
Oh, it would do worse than that.
Every country in the world would see this as their big chance to overtake the US. Russia, China, you name it.
You would have to be an idiot to start a software company in the US. High regulation, high cost of living, high taxes, high salaries, personal liability, and a market controlled by monopolies who have the resources to comply.
They’ll leave. The entire world will be offering every incentive to leave. China would offer $50K bonuses to every engineer that emigrated the next day.
I'm confused. Why would they emigrate? You just said "high salaries"?
Moreover, China is hardly low regulation. You would get there and then not be able to check your email.
It's exactly the opposite.
In the physical world, you mostly only have to defend against small-time attackers. No bank in the world is safe from, say, an enemy army invading. The way that kind of safety gets handled is by the state itself - that's what the army is for.
In the digital world, you are constantly being attacked by the equivalent of a hundred armies, all the time. Hackers around the world, whether criminals or actual state-actors, are constantly trying to break into any system they can.
So yes, many breaches involve some kind of software issue, but it is impossible to never make any mistake. Just like no physical bank in the world would survive 1000s of teams trying to break in every single day.
I thought state actors prefer to buy over build. Do they really need to build a Botnet over your personal computer over just expanding their own datacenter ?
Agreed on all counts.
This is why I think cyberattacks should be seen from the "victim"'s perspective as something more like a force of nature rather than a crime -- they're ubiquitous and constant, they come from all over the world, and no amount of law enforcement will completely prevent them. If you build a building that can't stand up to the rain or the wind, you're not an innocent victim of the weather, you failed to design a building for the conditions you knew would be there.
(I'm not saying that we shouldn't prosecute cyber crime, but that companies shouldn't be able to get out of liability by saying "it's the criminals' fault").
It's not possible to never make a mistake, no. But there's a huge spectrum between writing a SQL injection vulnerability and a complicated kernel use-after-free that becomes a zero-click RCE with an NSO-style exploit chain, and I'm much more sympathetic to the latter kind of mistake than the former.
The fact is that most exploits aren't very sophisticated -- someone used string interpolation to build an SQL query, or didn't do any bounds checking at all in their C program, or didn't update 3rd-party software on an internal server for 5 years. And for as long as these kinds of mistakes don't have consequences, there's no incentive for a company to adopt the kind of structural and procedural changes that minimize these risks.
In my ideal world, companies that follow good engineering practices, build systems that are secure by design, and get breached by a nation state actor in a "this could have happened to anyone" attack should be fine, whether through legislation or insurance. But when a company cheaps out on software and develops code in a rush, without attention to security, then they shouldn't get to socialize the costs of the inevitable breach.
The trade is already a constant struggle with management over cutting corners and short term thinking. I’m not about to be blamed for that situation.
Do you think the situation for real engineers is different?
Yes. Because whilst the same pressures exist, there's a short number of engineers licensed to actually sign off on a project, and they're not going to jeopardise that license for you.
Sounds like a case of real consequences for engineers working out well.
Only if you ignore downsides like drastically increased costs for most civil engineering projects.
Look, I wasn't expecting anyone to thank me for my service when I went back to school for COBOL and saved all of your paychecks circa '97 - '99, but I'm not going to sit here and be compared to those bucket-toting girder jockeys.
To be clear, this incident was not due to an attack—CrowdStrike just shot themselves in the foot with a bad update.
True, but the reason CrowdStrike has code running in a manner that is capable of bringing down the system, and the reason they push out updates all the time, is because they are in general combating attackers.
If there were no attacks, you wouldn't need such defensive measures, meaning the likelihood of a mistake causing this kind of damage would be almost nothing.
The other side of it is this. By law, a licensed civil engineer must sign off on a civil engineering project. When doing so, the engineer takes personal legal liability. But the fact that the company needs an engineer to take responsibility means that if management tries to cut too many corners, the engineer can tell them to take a hike until they are willing to do it properly.
Both sides have to go together. You have to put authority and responsibility together. In the end, we won't get better software unless programmers are given both authority AND responsibility. Right now programmers are given neither. If one programmer says no, they are just fired for another one who will say yes. Management finds one-sided disclaimers of liability to be cheaper than security. And this is not likely to change any time soon.
Unfortunately the way that these things get changed is that politicians get involved. And let me tell you, whatever solution they come up with is going to be worse for everyone than what we have now. It won't be until several rounds of disaster that there's a chance of getting an actually workable solution.
Engineering uses repeatable processes that will ensure the final product works with a safety margin. There is no way to add a safety margin to code. Engineered solutions tend to have limited complexity or parts with limited complexity that can be evaluated on their own. No one can certify that a 1M+ line codebase is free from fatal flaws no matter what the test suite says.
This is, in my opinion, an incredibly naive take.
There are currently decades of safety margin in basically all running code on every major OS and device, at every level of execution and operation. Sandboxing, user separation, kernel/userland separation, code signing (of kernels, kernel extensions/modules/drivers, regular applications), MMUs, CPU runlevels, firewalls/NAT, passwords, cryptography, stack/etc protections built into compilers, memory-safe languages, hardware-backed trusted execution, virtualization/containerization, hell even things like code review, version control, static analysis fall under this. And countless more, and more being developed and designed constantly.
The “safety margin” is simply more complex from a classic engineering perspective and still being figured out, and it will never be as simple as “just make the code 5% more safe.” It will take decades, if not longer, to reach a point where any given piece of software could be considered “very safe” like you would any given bridge. But to say that “there is no way to add a safety margin to code” is oversimplifying the issue and akin to throwing your hands up in the air in defeat. That’s not a productive attitude to improve the overall safety of this profession (although it is unfortunately very common, and its commonality is part of the reason we’re in the mess we’re in right now). As the sibling comment says, no one (reasonable) is asking for perfection here, yet. “Good enough” right now generally means not making the same mistakes that have already been made hundreds/thousands/millions of times in the last 6 decades, and working to improve the state of the art gradually over time.
Exactly.
Part of the evaluation has to be whether the disaster was due to what should have been preventable. If you're compromised by an APT, no liability. Much like a building is not supposed to stand up to dynamite. But someone fat fingered a configuration, you had no proper test environment as part of deployment, and hospitals and 911 systems went down because of it?
There is a legal term that should apply. That term is "criminal negligence". But that term can't apply for the simple reason that there is no generally accepted standard by which you could be considered negligent.
An Airbus A380 comprises about 4 million parts yet can be certified and operated within a safety margin.
Not that I think lines of code are equivalent to airplane parts, but we have to quantify complexity some way and you decided to use lines of code in your comment so I’m just continuing with that.
The reality is that we’re still just super early in the engineering discipline of software development. That shows up in poor abstractions (e.g. what is the correct way to measure software complexity), and it shows up in unwillingness of developers to submit themselves to standard abstractions and repeatable processes.
Everyone wants to write their own custom code at whatever level in the stack they think appropriate. This is equivalent to the days when every bridge or machine was hand-made with custom fasteners and locally sourced variable materials. Bridges and machines were less reliable back then too.
Every reliably engineered thing we can think of—bridges, airplanes, buildings, etc.—went through long periods of time when anyone could and would just slap one together in whatever innovative, fast, cheap way they wanted to try. Reliability was low, but so was accountability, and it was fast and fun. Software is largely still in that stage globally. I bet it won’t be like that forever though.
It seems to me if something is not safe and we can't make it reasonably safe, we shouldn't use it.
Except nobody is asking for perfection here. Every time these disasters happen, people reflexively respond to any hint of oversight with stuff like this. And yet, the cockups are always hilariously bad. It's not "oh, we found a 34-step buffer overflow that happens once every century, it's "we pushed an untested update to eight million computers lol oops". If folks are afraid that we can't prevent THAT, then please tell me what software they've worked on so I can never use it ever.
This entire comment boils down to "we can't be held accountable because it's soooo hard you guys", which isn't even convincing to me as someone in the industry and certainly won't be to someone outside it.
What a shallow dismissal of a comment that doesn’t even claim that there shouldn’t be accountability.
His dismissal is absolutely right though. Programmers have gotten way too used to waving their hands at the pubic and saying "gosh I know it's hard to understand but this stuff is so hard". Well no, sorry, there's not a single <= in place of a < that couldn't have been caught in a unit test.
You're right, in the case that it was known to be a problem. There are lots of places where the "<= or <" decision can be made, some long before some guy opens a text editor; in those cases, the unit test might not catch anything because the spec is wrong!
A major difference between software development and engineering is that the requirements must be validated and accepted by the PE as part of the engineering process, and there are legal and cultural rails that exist to make that evaluation protected, and as part of that protection more independent--which I think everyone acknowledges is an imperfect independence, but it's a lot further along than software.
To fairly impute liability to a software professional, that software professional needs to be protected from safety-conscious but profit-harmful decisions. This points to some mixture of legislation (and international legislation at that), along with collective bargaining and unionization. Which are both fine approaches by me, but they also seem to cause a lot of agita from a lot of the same folks who want more software liability.
That's why you have three different, independent parties design everything important thrice, and compare the results. I'm serious. If you're not convinced this is necessary, just take a look at https://ghostwriteattack.com/riscvuzz.pdf.
(Your other suggestions are also necessary, and I don't think that would be sufficient.)
I think that's a great idea, and when I've been in a leadership role I've at least tried to have important things done at least twice. ;)
And you're right, I was pretty much just outlining what might be called "a good start".
I fail to see the difference between a misplaced operator and a misplaced bolt (think Hyatt walkway collapse), both of which could have catastrophic consequences. Do you think the CAD software they use to perform the calculations is allowed have bugs simply because it's software?
Maybe go back to entering code on punch cards if you're so fixated on the physical domain being the problem.
There's a reason we talk about the Hyatt walkway collapse but not the misplaced operator.
It could happen. People have been predicting it for years, and many think that it is only a matter of time. For a vision from 1982 of how it could happen, see: <https://books.google.com/books?id=6f8VqnZaPQwC&pg=PA167>
Consider the following scenario. We are living in 1997, and the world of office automation has finally arrived. Powerful computers that would have filled a room in 1980 now fit neatly in the bottom of drawer of every executive’s desk, which is nothing more than heavy glass plate covering an array of keyboards, screens, and color displays.
— The Network Revolution: Confessions of a Computer Scientist; Jacques Vallee, 1982
This is all true. But we _do_ have known best practices that reduce the impact of bugs.
A most trivial staged rollout would have caught this issue. And we're not talking about multi-week testing, even a few hours of testing would have been fine. Failure to do that rises to the level of gross negligence.
True but they are under time pressure to add definitions for emerging vulnerabilities.
That may have been true a couple hundred years ago. It's not been true for a couple decades now, because budget became a constraint even more important than physics, and believe it or not, you will have to justify every dollar that goes into your safety margin. That's where the accuracy of modern techniques matter: the more accurate your calculations (and the more consistent inputs and processes builders employ), the less material you can use to get even closer to the designed safety margin. Accidentally making a bridge too safe means setting money on fire, and we can't have that.
That's the curse of progress. Better tools and techniques should allow to get more value - efficiency, safety, utility - for the same effort. Unfortunately, economic pressure makes companies opt for getting same or less[0] value for less effort. Civil engineering suffers from this just as much as software engineering does.
--
[0] - Eventually asymptotically approaching the minimum legal quality standard.
There's a quote I've seen various versions of: anyone can build a bridge that is safe. It takes an engineer to build a bridge that is just barely safe.
This is less complicated than you think.
Civil engineering rules, safety margins and procedures have been established through the years as people died from their absence. The practice of civil engineering is arguably millennia old.
Software is too new to have the same lessons learned and enacted into law.
The problem isn’t that software doesn’t have the kind of practices and procedures that would prevent these kinds of errors, (see the space shuttle code for example), it is that we haven’t formalized their application into law, and the “terms of service” that protects software makers has so far prevented legal case law from ensuring liability if you don’t use them.
Software engineering, compared to other engineering disciplines, has had a massive effect on the world in an incredibly short amount of time.
did they take basic precautions like staged releases, code reviews, integration tests?
if not, then it's literally the engineer equivalent of gross negligence and they do deserve to be sued to oblivion.
Do people actually believe when a company says something caused billions of dollars of damage? unless you can quantify that, much like law enforcement and articulate suspicion, it's pretty useless as a metric. If you can pull something out of your ass, what does it matter?
I like the analogy. What would the equivalent of « adding safety margins » to a piece of critical code ? Building three of them with different technologies and making sure all return the same results ?
Doctors, engineers, and lawyers aren't infinitely accountable to their equivalent of bugs. Structures still fail, patients die, and lawyers lose cases despite the reality of the crime.
But they're liable when they fuck up beyond what their industry decides is acceptable. If Crowdstrike really wasn't testing the final build of their configuration files at all, then yeah -- that's obviously negligent given the potential impact and lack of customer ability to do staged rollouts. But if a software company has a bug that wasn't caught because they can't solve the halting problem, then no professional review board should fault the license holder.
I think we just (oh god -- no sentence with a just is actually that easy) need to actually look at other professional licenses to learn how their processes work. Because they've managed to incorporate humans analyzing situations where you can't have perfect information into a real process.
But I don't think any of this will happen while software is still making absolute shit loads of money.
If a bridge collapses people die. To my knowledge, nobody died or was put in mortal peril as a result of the Crowdstrike debacle.
The deaths if any where probably indirect. E.g. ambulances not turning up in time etc. due to paper and pen fallbacks.
With all the hospitals victim of the attack, I would be surprised if the amount of patients that died are zero.
Sure. Did this happen?
Why were the “emergency management downtime procedures” insufficient [1]?
[1] https://www.healthcaredive.com/news/crowdstrike-outage-hits-...
If they were equally good as the non-emergency procedures, why wouldn't we use them all the time?
Because they’re more expensive. They’re all not “equally good,” they’re good enough to keep people alive. (You repurpose resources from elective and billing procedures, et cetera.)
This belies a lack of understanding.
What resources are you repurposing from elective procedures exactly? Your patient load hasn’t changed, and day surgical instruments and supplies are from the same pool. There’s no “well this pile of equipment is only for elective procedures”.
I’m not even sure what “billing procedures you’d repurpose (especially in your context of “keeping people alive”).
The outage didn’t change any of these things either.
At Mount Sinai, billing staff were redirected to watch newborn babies. Apparently the electronic doors stopped working during the outage.
Never said that it did. I just don't think your idea of emergency downtime procedures at a hospital are what they are. There's paper and offline charting, most meds can be retrieved similarly, and so on. I heard a claim (from someone here) that an ER was unable to do CPR due to the outage, which could not be remotely true. Crash carts are available and are specifically set up to not require anything else but a combination. Drugs, IV/IO access, etc.
That sounds like something I would have imagined security doing. To be clear, what they most likely meant here is in the sense of "avoiding abduction of a newborn", not any kind of access to observe and oversee neonates.
I would expect them to be good enough to prevent "obvious" deaths-from-failed-procedures, but deliver a slightly lower quality of care, so that if out of 100 very seriously ill people 50 survived during normal operation, this would turn into e.g. 49.
All of this without the person obviously dying due to the alternative procedures - just e.g. the doctor saw the patient less often and didn't notice some condition as early as they would have under normal procedures.
Would you consider this assumption to be wrong? (I am a layperson, not familiar with how hospitals work except from being a patient.)
"In Alaska, both non-emergency and 911 calls went unanswered at multiple dispatch centers for seven hours.
Some personnel were shifted to the centers that were still up and running to help with their increased load of calls, while others switched to analog phone systems, Austin McDaniel, state public safety department spokesperson, told USA TODAY in an email. McDaniel said they had a plan in place, but the situation was "certainly unique.”
Agencies in at least seven states reported temporary outages, including the St. Louis County Sheriff's Office, the Faribault Police Department in Minnesota, and 911 systems in New Hampshire, Fulton County, Indiana, and Middletown, Ohio. Reports of 911 outages across the country peaked at more than 100 on Friday just before 3 a.m., according to Downdetector.
In Noble County, Indiana, about 30 miles northwest of Fort Wayne, 911 dispatchers were forced to jot down notes by hand when the system went down in the early morning hours, according to Gabe Creech, the county's emergency management director."
https://eu.usatoday.com/story/news/nation/2024/07/19/crowdst...
I mean, even if the dispatch could handle it in some sense, certainly it was a problem, that might have increased average time to site for the ambulance or fire fighters. I've haven't seen any report of any direct death.
Exactly. Contrast that with a bridge collapse. It isn’t a mystery or statistical exercise to deduce who died and why.
There were numerous bridge collapses without casualties. Naturally if one company could suddenly collapse 80% of Earth's bridges, direct deaths would be assured. It's great there isn't one for some reason!
In how many of those cases were criminal charges brought? (It’s not zero. But it’s more limited.)
Probably because our incredibly inefficient, burdened, and splintered healthcare system barely functions as is, and they do not have the time nor resources to pause and put in place an emergency downtime operating protocol that works as well as their 15 year old windows cobweb
You just responded to an article about the implementation of emergency downtime protocols by speculating, baselessly, that such protocols cannot possibly exist because your mental model of our healthcare system prohibits it. Ironically, all within the context of why software development doesn’t hold itself to the rigors of engineering.
I have argued for years every business should have an analogue operations guide tested every once in a while like a fire drill down to pre-printed carbon copypaper forms. A Lights Out Phones Off Business Continuity Plan would have helped American Airlines too.
Because energencx downtime is not supossed to be local and global. Dont worry your startup will not eat those riscs, but neither will those customers stay once insurrance rewrites the guidlines. All that can happen,has already happened, its just consequences propagating now. Nothing we can do with simple blameshifting tactics.
I am positive that people in hospitals died as a direct result of this incident.
Do you have clinical or hospital administration experience? A source with evidence, even circumstantial?
Yes
You managed a hospital and failed to implement emergency downtime procedures? (Because that is actually criminal.) Or do you have a source?
Are you the orangutan doctor from futurama?
The commenter said they did not believe hospitals “have the time nor resources to pause and put in place an emergency downtime operating protocol” [1]. That is a reasonable guess. It’s not something one would expect from someone with “clinical or hospital administration experience.”
It’s a glib response, but so is “yes” to a request for attribution.
[1] https://news.ycombinator.com/item?id=41217683
Apropos of anything else, “emergency downtime procedures” do not guarantee the same level of care as normal operations. I’ve worked in and out of hospitals as a critical care paramedic for years.
Agreed. It’s also plausible someone had a heart attack due to the stress of flight cancellations. Do we have any evidence of either?
The difference between a bridge collapsing and everything we’re discussing is there isn’t much of a discussion around who died and why.
Deft goalpost shifting, nice.
Hospitals were affected too, I don't think it's that far fetched to think some people died, or at least some could not have been saved due to this incident.
Absent evidence I’d say it is.
Hospitals have emergency downtime procedures [1]. From what I can tell, the outage was stressful, not deadly.
[1] https://www.npr.org/2024/07/21/nx-s1-5046700/the-crowdstrike...
Apply additional stress to a sufficiently large system that human lives depend on, and someone, somewhere will die.
Sure. Who did?
When a bridge collapses, this isn’t a tough problem. We don’t need to reason from first principles to derive the dead bodies. That’s the difference.
Hospitals and doctor’s offices were paralyzed by the outage. Transplant organs are often delivered by couriers on commercial flights. Many pharmacies were unable to fulfill prescriptions.
It wasn’t just vacation travelers that were affected by Crowdstrike’s incompetence.
Small reminder that the law already has a way of deciding liability for damages, and you don't have to directly drop a bridge on someone to get in trouble.
Heck, no.
Civil engineering doesn’t change. Gravity is a constant. Physics are constants. If Rome wrote an engineering manual, it would still be quite valid today.
Imagine if we had standardized software engineering in 2003. Do you think the mandatory training about how to make safe ActiveX controls is going to save you? Do you think the mandatory training on how to safely embed a Java applet will protect your bank?
Software is too diverse, too inconsistent, and too rapidly changing to have any chance at standardization. Maybe in several decades when WHATWG hasn’t passed a single new spec into the browser.
(Edit: Also, it’s a fool’s errand, as there are literally hundreds of billions of lines of code running in production at this very moment. If you wrote an onerous engineering spec; there would not be enough programmers or lawyers on earth to rewrite and verify it all, even if given decades. This would result in Google, Apple, etc. basically getting grandfathered in while startups get the burden of following the rules - rules that China, India, and other countries happily won’t be enforcing.)
I'd imagine we wouldn't have ActiveX controls in the first place.
Wishful thinking - the IRS is still running on COBOL; our nuclear weapons until a few years ago on Windows 95. The NYC subway still has a lot of OS/2.
Standardization does not stop bad engineering. Those who think it does have not witnessed the catastrophe a bad standard can cause. Go download and implement the Microsoft Office OOXML standard - it’s freely available, ISO approved, 6000 pages, and an abomination that not even Google claims to have correctly implemented.
You're making some points for me. You are assuming COBOL, Windows 95, or OS/2 are bad because they're old. Such assumptions are the antithesis of "engineering."
It sounds as if you're saying that these were bad things because they were always bad. And maybe they were. But we might never have any software at all if we only had good software.
I'm not saying they're bad because I don't know.
Apologies. I misread your intentions.
Old technology isn’t necessarily bad in itself. It’s well documented and understood.
Where it’s bad is when the equipment to run that software no longer is manufactured. You can’t get a new computer to run Windows 95. Not even in the military. Your only option is to virtualize, adding a huge possible failure mode that was never considered previously.
Where it’s bad is when changes are needed to adapt to modern environments, and nobody’s quite sure about what they are doing anymore. There’s no test suite, never was, the documentation is full of ancient and confusing terminology, mistakes are made.
And on and on…
This is so wrong
Most suspension bridges were built without a theoretical model, because didnt have one yet. Theory caught up much later.
Innovation often happens in absence of Theory.
That's not true, even for the first suspension bridge ever built (in the early 1800s), but it is true for example that many useful and impressive aircraft were built before the development of a physical theory of flight.
Galloping gertie is an example in America.
Your definition of theory only fits if you scope it so narrow that it's useless to the problem space.... Because the point is that theory didn't entirely cover that space. And bridges did collapse because of that.
But lack of theory didn't mean lack of rigorous testing. Gergie was built based on theory. Many other bridges were based on testresults..and did fine.
You've retreated from, "built without a theoretical model, because didn't have one yet," way back to, "theory didn't entirely cover that space." This is commendable.
I'm going to go out on a limb a little and assert that not a single bridge was built out of steel or iron in the last 200 years in the US or the UK without a static analysis of the compressive and tensile forces on all the members or (in the case of bridges with many hundreds of small members) at least the dozen largest members or assemblies.
Civil engineering doesn’t change. Gravity is a constant. Physics are constants.
Physics may be a constant, but materials and methods are not. There is a reason why ISO/IEC/ICC/ASTM/ANSI/ASME/ASHRAE/DIN/IEEE/etc standards have specific dates associated with them.
If Rome wrote an engineering manual, it would still be quite valid today.*Considering many engineering standards from a few years ago are no longer valid, this is almost certainly not true.
We have some ancient engineering manuals. A book I read, most likely Brotherhood of Kings, remarked that Mesopotamian engineering manuals are primarily concerned with how many bricks will be required for a given structure.
The manuals are valid today, I guess, but useless. We prefer pipelines to brick aqueducts. Our fortresses are made of different materials and need to defend us from different things.
That’s only a formality, but reality did not change, and neither did the fact that those standards would still work even if they would be slightly inferior.
There is nothing saying that allowing for some standardization means that we have to be stuck at 2003-levels of state of the art. And actually, yes many engineering disciplines do change, Civil engineering brings in new construction techniques, methods for non-destructive testing, improvements to materials and on and on, but it doesn't do so like the coked-up industry of software does it in such a free-for-all manner. It's a proper engineering discipline because there's the control, testing the best way to do things and rolling that out.
If we (meaning software 'engineers' and I tepidly include myself in that group) had half the self control in introducing insanity like the 10000th new javascript framework to read and write to a database like the 'proper' disciplines do, maybe it would be better because there's less churn. Why does it have to move so fast? Software is diverse and inconsistent and rapidly changing because 'the industry' (coked-out developers chasing the next big hit to their resume to level up) says it should. I just don't agree that we need that amount of change to do things that amount to mutating some data. If the techniques didn't grow beyond what was cool in 2007, or they were held there until the next thing could be evaluated and trained, but the knowledge and process around them did, perhaps we'd be in a better position. I know I certainly wouldn't mind maintaining something that was created in the last decade of the previous millennium knowing it was built with some sort of self-control and discipline in mind, and that the people working on it with me had the same mindset as well.
Simple - if you restrict the software industry, the US loses to China or any other country that doesn’t give a damn. And unless you censor the internet, there’s absolutely no way to prevent illicit software from crossing the border.
Would a business get in trouble for using it? Sure. But if all the businesses in your country are at a competitive disadvantage because the competition is so much brighter elsewhere, and that “sloppy constructed” software is allowing international competition to have greater productivity and efficiency, your country is hosed. Under your own theory, imagine if the US was stuck with ~2007 technology while China was in 2024. The tradeoff would be horrific - like, Taiwan might not exist right now, horrific.
Regulating software right now would kill the US competitive advantage. It narrows every year - that would do it overnight. The US right now literally cannot afford to regulate software. The EU, which can afford it, is already watching talent leaving.
There’s also the problem of the hundreds of billions of lines of code written before regulations running in production at this very moment. There are not enough programmers on earth that could rewrite it all to spec, even if they had decades. Does Google just get a free grandfathered-in pass then, but startups don’t?
In Canada, we have software and computer engineering programs accredited by the same entity (CEAB) that does civil engineering.
My program is more out of date (Java Server Pages, VHDL) but the school can't lower the quality of their programs. Generally, the standard learning requirements aren't on technology but principles, like learning OOP or whatever else. The CEAB audits student work from all schools in Canada to make sure it meets their requirements.
The culture itself is probably the most important part of the engineering major. They don't round up. If you fail, you fail. And I had a course in 3rd year with a 30% fail rate. Everything's mandatory, so you just have to try again over the summer.
A lot of people drop out because they can't handle the pressure. But the people that stay understand that they can't skip out on stuff they aren't good at.
I've got an ABET accredited Computer Engineering degree from a US school. The only thing it got me in interviews was questions about why not CS.
I did not follow the path to become a licensed Professional Engineer, because a) there was no apparent benefit, b) to my knowledge, none of my colleauges were PEs and I don't know how I would get the necessary work certification.
Maybe there's corners of software where it's useful to become licensed, but not mine.
I hope you realize that "sowwy, there's too much code :3" will not fly with whatever government decides to regulate software after the next major cock-up. We can either grow up and set our own terms, or we can have them forced on us by bureaucrats whose last computer was an Apple II. Choose.
Bull - regulators can’t change reality.
The fact that China is X number of years behind us, is easily demonstrable.
The amount of code running in the US, Y, is relatively easy to estimate by asking around.
Proving the amount of time it would take to modify Y lines of code to match any given law will exceed X number of years, thus putting us behind China, is also fairly easy to demonstrate, even if the exact amount of time is not.
Even our Apple II-era regulators know that going beyond that much effort (call it Z) is suicidal politically, economically, technologically, you name it. They might not understand tech, but they know it’s everywhere, and falling behind is not an option.
On that note, stop stereotyping our legislators. They have smartphones, younger aids, many of the oldest ones are retiring this cycle, etc.
“How to conduct water efficiently: first, collect a whole bunch of lead. Then construct pipes from said lead…”
Physics will still be the same when your faulty software tells an airplane to dive.
Gravity is not constant, instead it varies by location and by height.
Bubble sort however, is always bubble sort. A similarly large portion of what engineers do with in software is constant
There is no reason that software couldn't be treated with the same care and respect. The only reason we don't is because the industry resists that sort of change. They want to move fast and break things while still calling themselves "engineers." Almost none of this resembles engineering.
— Edsger Wybe Dijkstra, 1988. (EWD1036)
I'm ok with that. I don't want to keep everyone out except just those who happen to have just the right mind set. Programming is about developing software for people, and the more viewpoints are in the room, the better.
Some pieces are more important than others. Those are the bits that need to be carefully regulated, as if they were bridges. But not everything we build has lives on the line.
If that means we don't get to call ourselves "engineers", I'm good with that. We work with bits, not atoms, and we can develop our own new way of handling that.
Neither do I. Neither did Dijkstra. EWD1036, “On the cruelty of really teaching computing science”, is about education reform, to enable those who don't "happen to have just the right mind set" to fully participate in actual, effective programming.
I suspect this particular title-exaggeration is fueling this particular fire.
Going forward, I believe we need to be aware that software controlled mechanics grew out of two disparate disciplines; it presently lacks the holistic thinking that long-integrated industries do.
I prefer to call it "computer programming." If the title is good enough for Ken Thompson or Don Knuth then it's good enough for me.
I’m a software engineer, with a degree, and SWE does have the same ethical principles and the same engineering process, from problem definition and requirements a the way to development lifecycle, testing, deployment indigent management, etc. none of it includes sprints and story points.
Suffice it to say most SWEs are not being hired to do actual engineering, bc the industry can’t get over the fact that just because you can update and release SW instantly doesn’t mean you should.
The lack of certification means this training isn’t reinforced to the degree it is in engineering.
Right. If the coding industry mimics the construction industry, we wind up with one position called engineer that assumes most of the liability.
The other 99.99....% of software engineers will get different titles.
All of this ignores the individuals who are most responsible for these catastrophes.
Investors and executives deliver relentless and effective pressure toward practices that maximize their profits - at the expense of all else.
They purposefully create + nurture a single point of failure and are massively rewarded for the harm that causes (while the consequences are suffered by everyone else). Thanks to the pass they reliably get, we get their leadership design degrading every industry it can.
If their sign off is required, this could work. The question is whether it’s worth it, and if it is, in which contexts.
Civil engineers liability is tied to standards set by gov agencies/depts and industry consortium.
Standards would have to be created in software engineering - along with the associated gov & industry bodies. In civil engineering, those things grew during/from many decades of need.
To be fair, software and technology is so magically transformative that even with warranty disclaimers like “this software comes with no warranty, and we limit our liability to the purchase price”, every company in the world still lines up to buy it. Because for them it’s effectively magic, that they cannot replicate themselves.
No individual software developer, nor corporation, is foolish enough to claim their software is free of bugs, that’s why they put the risk on the customer and the customer still signs on the dotted line and accepts the risks anyway. After all, it’s still way more profitable to have the potentially-faulty software than needing an army of clerks with pen and paper instead.
Most software has to be this way or it would be exorbitantly expensive. That’s the bargain between software developers and the companies that buy the software. Customer accepts the risks and gets to pocket the large profits that the software brings (because of the software’s low cost), because it’s better than the software developer balking at the liability, no software being written at all, and having an army of staff every airport writing out boarding passes by hand. There are only a few softwares that aren’t this way - example the software in aircraft or nuclear power plants. That software is correspondingly extremely expensive. Most customers that can, choose to accept the risks so they can have the larger profits.
A lot of companies have insurance on events causing them to lose sources of income. Whether that's farmers having crop insurance, big box retailers having insurance for catastrophic damage to their big box, I would assume there's something for infrastructure collapse to bring sales to $0 for the duration.
Even if everyone that was affected sued ClownStrike for 100% of their losses, it's not like ClownStrike has the revenue to cover those losses. So even if you're a fan of shutting them down, nobody recovers anything close to actual losses.
So what would you actually propose? Bug free code is pretty much impossible. Some risk is accepted by the user. Do you seriously think that software should be absolutely 100% bug free before being able to be used? How do you prove that? Of course, the follow up would be how clean is your code that you feel that's even achievable?
>Bug free code is pretty much impossible. Some risk is accepted by the user.
This wasn't your average SW bug, it was gross negligence on behalf of Crowdstreike, who seems to not have heard of SW testing on actual systems and canary deployment. Big difference.
Yeah SW bugs happen all the time but you have to show you took some steps to prevent them, while some dev at Crowdstrike just said "whatever, it works on my machine" and directly pushed to all customer production systems on a Friday. That's the definition of gross negligence that they didn't have any processes in place to prevent something like this.
That's like a surgeon not bothering to sterilize his hands and then saying "oh well, hospital infections happen all the time".
The bug was egregious.
Using regexp (edit: in the kernel). (Wtf. It's a bloody language.) And not sanitizing the usage. Then using it differently than testing. And boom.
There's people, and there's companies.
This company ought to be nuked.
Genuinely, what good does that do?
It’s all well and good to write dramatic meaningless comments on social networks like Hacker News, but if your desired had actual consequence, can you honestly say that “nuking the company” is a net positive?
You should look up Arthur Anderson
Is keeping CrowdStrike around a net positive?
And hospitals and doctors have malpractice insurance. They also go through an investigation where they have their own brotherhood where it is difficult to get other doctors to testify against. There's also stories of people writing on their good leg "The other leg" in Sharpie because such moronic mistakes of removing left appendage instead of right. So even doctors are not above negligence. We just have things in place for when they do. Why you think ClownStrike is above that is bewildering.
At the end of the day, mistakes happen. It's not like they have denied they were at fault. So I'm really not sure what you're actually wanting.
>It's not like they have denied they were at fault. So I'm really not sure what you're actually wanting.
Paying for their mistake. In money. Admitting for their mistake is one thing, paying for it is another.
If your doctor made a mistake due to his negligence that costs you, wouldn't you want compensation instead of just a hollow apology?
Want vs receive are two entirely different things. If someone did something against me in malice, damn straight I want ________. If someone makes a mistake, owns up to it, changes in ways to not make same mistake again, then that's exactly the opportunity I'd hope someone would allow for me to have if the roles were reversed. This particular company's mistake just happened to be so widespread due to their popularity makes it seemingly egregious, but there are other outages that have occurred that lasted longer and did not draw this much attention. Was it an inconvenience, yes. Was it a silly mistake in hindsight, yes. Was it fixable, yes. Was it malevolent, nope. Should you lose your job for making this mistake?
Bug-free code is impossible. Stupid, negligent bug-free code, however, is very much doable. You just can't hire anyone who happens to be able to fog a mirror to write it.
If you think this was written by a moron vs a break down in procedures, then I'd think you'd be one that barely fogs a mirror. This is no different the multiple times that AWS us-east-1 has gone down and taken down a large portion of the internet when they've pushed changes. Do you think AWS is hiring moronic mirror foggers causing havoc or just examples of how even within a bureaucratic structure within AWS it is still possible to side step best laid plans?
There is recourse, just not for normal people, as you eluded to. Companies are and will be continuing to sue crowdstrike, and based on the papers that crowdstrike has posted, the impacted companies are extremely likely to be successful. It seems overwhelmingly likely that the companies are going to be able to convince a judge/jury/arbiter that crowdstrike acted grossly negligent and very plainly caused both direct losses and indirect reputational harm to the companies.
I’m not sure crowdstrike will even fight it, to be honest. I would assume most of this is going to be settled out of court and we will see crowdstrike crumble in the coming years.
To my knowledge only Delta is suing and CrowdStrike is kicking and screaming about it [1].
[1] https://www.cnn.com/2024/08/05/business/crowdstrike-fires-ba...
It’s a really bad look for crowdstrike to be going down this route. Then again, I don’t think many companies are going to be adopting crowdstrike in the coming years, so I suppose their only option is to defend their stock value at any cost while the company recoils
NB: alluded, not eluded.
<https://dict.org/bin/Dict?Form=Dict2&Database=gcide&Query=Al...>
<https://dict.org/bin/Dict?Form=Dict2&Database=gcide&Query=el...>
I completely agree. When I've negotiated contracts for my workplace, and we explicitly write in the contract that the vendor is responsible for XYZ, it is my understanding (and confirmed by legal, multiple times) that this means in case of XYZ going wrong, they are liable for up to the amount in the SLA, however that isn't a cap on liability in extenuating circumstances.
If this all gets brushed away, it significantly devalues the "well we pay $VENDOR to manage our user data, it's on them if they store it incorrectly" proposition, which would absolutely cause us to renegotiate.
You aren’t showing us the specific language that you’re referring to, nor do we know what a typical CrowdStrike contract looks like. You could be talking about apples and oranges here. I’ve seen both.
I was pretty sure that someone was going to "ackshually" me here, and here we are. The specific wording doesn't matter.
I've negotioated dozens of these contracts and the value add of a vendor managing the data is liability. If they aren't liable for data mis-management, then their managed service is only worth the infra costs + a haircut on top, and we'll renegotiate with this in mind.
Civil engineering mostly requires you to have a government-verified certificate and to work in the country your infrastructure will be deployed in.
Software engineering doesn't, and that makes criminal prosecutions that much harder. There's no path to making it happen.
Financial liability for the company in question? Sure, that's probably doable. "Piercing the corporate veil" and punishing the executives who signed off on it? Harder but not impossible. Punishing the engineer who wrote that code, and who lives in a country with no such laws? Won't happen.
It's a relatively small (and sharply defined) pool of people who can be called a civil engineer.
Are we saying we want to segment software engineering (from coding) - the same way civil engineering is segmented from construction?
Otherwise we're talking about placing specialist liability upon a non-specialist group. This seems unethical.
I would find it more useful if liability here we're attributed to the need to purchase such draconian tools. Certifications that require it and C levels who approve it. We would be better by it.
Oh Christ. Just drop it. A by all accounts legitimate security function of a product targeted at company-owned endpoints.
Please don’t devolve this conversation into you being upset about not getting admin rights on your work computer or whatever this is about.
Any (esp. larger) org would be criminally negligent to eschew using something like CrowdStrike in order to capitulate to some nerd that thinks that they have ownership over their work equipment.
Yes, time. Civil engineering has thousands of years of history. Software engineering is much newer, the foundations of our craft are still in flux. There have been, at least in my country, legislative proposals for licensure of system analysts, electronic computer programmers, data processing machine operators, and typists(!) since the late 1970s; these laws, if approved, would have set back the progress of software development in my country for several decades (for instance, one proposal would make "manipulation and operation of electronic processing devices or machines, including terminals (digital or visual)" exclusive to those licensed as "data processing machine operator").
Sounds to me like it just would've made a lot of money for whatever entities give out the licenses.
On the other hand, I've read speculation on here that some countries are short on entrepreneurs entirely due to the difficulty of incorporating a small business, so maybe.
Probably in a decade or so after the AI crash. I have yet to see anything that comes close to “liability” for the digital realm.
US governments and businesses get hacked/infiltrated all the time by foreign adversaries yet we do not declare war. Maybe something happens in the dark or back channels. But we never know.
Software (controls) engineers at VW during the emissions scandal went to jail, engineers at GM were held liable for the ignition switch issue (not mostly in software, but still). I expect we'll eventually see some engineers/low level managers thrown under the bus with Boeing. It definitely happens, but not as frequently as it could. That said, I definitely prefer Amazon's response to the AWS East 1 outage back in 2016 -- the engineer wasn't blamed, despite the relatively simple screw up, but the processes/procedures were fixed so that it didn't happen again in the last 8 years. Crowdstrike is a little bit gray on that regard -- people should have known how bad the practice of zero testing on config updates was, but then again, I've seen some investigating saying that the initial description wasn't fully accurate, so I'm waiting for the final community after action report before I really pass judgement.
Potentially controversial stance here, but most software engineers are not engineers. They study computer science, which doesn't include coursework on engineering ethics among other things. I would say that by design they are less prepared to make ethical decisions and take conservative approaches.
Imagine if civil engineers had EULAs for their products. "This bridge has no warranty, implied or otherwise. Cross this bridge AT YOUR OWN RISK. This bridge shall not be used for anything safety critical etc."
It hasn't been that long? The situation might be that there hasn't been sufficient time to yet gather evidence to commence lawsuits.
I’m interested in what those who suffered outages as a result of crowdstrike told their insurers with respect to “QA’ing production changes”
It’d be interesting to see if anyone tries to claim the outage as some sort of insurance event only to lose out because they let Crowdstrike roll updates into a highly regulated environment without testing
On this website you are asking a population that would be responsible for this, so you will likely only get answers about how hard this is to solve and how it’s not software engineers fault and how we need to understand software engineering is not civil engineering and we need to be careful with this analogy and how it’s not our fault! Don’t blame us when things go wrong, but also, give us all the money when things go right.
This is not the place for this question is what I’m saying.
Engineering safety culture is built on piles of bodies and suffering unfortunately. I suspect in software the price of failure is mostly low enough that this motivation will never develop.
How many bridges, would you say, does the average civil engineering firm deliver each year, each on only 1 day notice, in response to a surprise change in requirements due to a newly developed adversarial attack?
Crowdstrike does this constantly.
You could demand the same level of assurance from software, but in exchange, you don't get to fly, because the capacity won't be there
They totally deserved it.
Those who think running third-party closed-source Windows kernel driver(which parse files distributed from Internet in realtime) are good for the security, they must also accept the consequence.
I'm sick of these so-called security consultants who always insist check lists like installing proprietary close-source binary blob Linux kernel module to the system consists of otherwise mostly free softwares except for hardware drivers and think they did their job, or executives who pays a lot of money to these idiotic so-called security consultants.
One of the biggest and most used piece of software (the Linux kernel) comes with zero warranties. It can fail, and no one would be liable. Are we fine with that? Is the CS case different because it costs money? From an user perspective we don’t want software failing in the middle of an airplane landing, so whether the software comes from CS or github, it’s of lesser importance.
https://zlk.com/pslra-1/crowdstrike-lawsuit-submission-form?...
https://www.sauderschelkopf.com/wp-content/uploads/2024/07/A...
Even on Hacker News, there was agreement that CrowdStrike screwed up, but then people also blamed IT staff, Microsoft (even after realizing it was a CrowdStrike issue), and the EU/regulators.
I imagine responsibility of each entity would need far more clarification than it does now.
If you want to define liability, there needs to be a clear line saying who is responsible for what. That doesn’t currently exist in software.
There are also considering how people respond to risk.
Consider how sesame regulation led to most bread having sesame deliberately put into it. Industry responded by guaranteeing contamination.
Crowdstrike and endpoint security firms might respond by saying that only Windows and Mac devices can be secured. Or Microsoft may say that only their solution can provide the requisite security.