I don't know about others, but I can't help but smile when I read the detailed series of events in aviation postmortems. To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry. I say that sincerely since mistakes are going to happen and in my view robustness has less to do with the number of mistakes but how one responds to them.
Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.
And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.
As a former Boeing engineer, other industries can learn a great deal from how airplanes are designed. The Fukushima and Deepwater Horizon disasters were both "zipper" failures that showed little thought was given to "when X fails, then what?"
Note I wrote when X fails, not if X fails. It's a different way of thinking.
As an engineer I think a lot about tradeoffs of cost vs other criteria. There is little I can learn from nuclear or aviation industry, as the cost structure ist so completely different. I’m very happy that the costs of safety in aviation are very good accepted, but I understand that few people are willing to pay similar costs for other things like, say, cars.
Cars might not be the best example, since human lives are at stake, as in aviation. Unless you work on Teslas autopilot, it seems. But yes, backups and restores are often good enough.
Any substantiation for "Unless you work on Teslas autopilot, it seems"?
I mean you're implying that there are more accidents with autopilot than without it, right? Seems like quite the claim...
Tesla people always try to reduce any critique to some metric on deaths per x.
The fact is, there’s a lot of history and best practice around building safety critical systems that Tesla doesn’t follow.
Additionally, even with the practices they follow, they call a consumer facing product that isn’t really an autopilot “autopilot”, while focusing outbound comms on a beta product that is more like an autopilot, but not available to them.
I agree with most of this but the naming of "autopilot" seems fine. Nobody expects commercial aircraft to fly on autopilot without a pilot's supervision, the same _should_ be true of Tesla vehicles (especially considering their tendency to jump into the wrong lane and phantom brake on the highway etc.)
What matters is what the user of the system thinks because that’s where confusion can be dangerous.
A plane pilot knows very well what the limits of the autopilot are and what the passenger believes is irrelevant.
Conversely if too many/most car “autopilot” users believe it does more than what it really does then it’s dangerous.
In electrical engineering 600V is still “low voltage”. Any engineer in the field knows that so that’s fine right? But if someone sells “low voltage” electric toothbrush or hand warmer no normal person will think “it’s 600V, it will probably kill me”. When you sell something, what your target audience takes away from your advertisement matters. If they’re clearly confused and you aren’t clearing it up after so many years then “confusion” and misleading advertising are part of your sales strategy.
No, I'm implying that the autopilot code has not been as thoroughly tested as it should have been.
Example: https://www.theguardian.com/technology/2023/nov/22/tesla-aut...
Considering Tesla was willing to do unsafe things in visible way (e.g, running stop signs feature), then I have no trust that they are maintaining safety in the less visible ways.
As it turns out (and as much as we wouldn’t want them to) human lives are still subject to cost/benefit analysis.
An airliner is a lot of lives, a lot of money, a lot of fuel, and a lot of energy. Which is why a lot has been invested in training, procedure, and safety systems.
Cars operates in an environment which is in most ways a lot more forgiving, they’re controlled by (on average) low-training low-skill non-redundant crews, they’re much more at risk of “enemy action”, the material stresses are in a different realm, and they’re much, much more sensitive to price pressure.
Hell, the difference is already visible in aviation alone, crop dusters and other small planes are a lot less regulated amongst every axis than airliners are.
I wouldn't say it's simply cost-benefit analysis. It's also scale of accidents.
A whole lot more people die from car accidents, yet there are few reports on national news on accidents. So fewer people care. Meanwhile each time there is an aviation disaster, 100s of people die and it's all over the news for weeks. Similarly with train accidents and nuclear accidents. There where only 2 very large ones but they still haunt the field to this day, while (for example) the deaths from solar installations by people falling from roofs are mostly ignored.
Large accidents have to be avoided, a lot of small ones are more acceptable.
But that is cost/benefit analysis. When any accident can kill hundreds and do millions to billions in damage besides (to say nothing of the image damage to both the sector and the specific brand), the benefit of trying to prevent every accident is significant, so acceptable costs are commensurate.
I think it goes beyond what you'd expect just from the increased scale putting more lives at risk. Compare our regulatory system for buses and cars, two transportation options that are probably as close as possible to differing only in scale. Buses are ~65x less deadly than cars, and yet we still respond to the occasional shocking bus accident by trying to make them safer.
Which is actually counterproductive! This makes it harder to compete as a bus service, bus lines shut down, and more people drive. I wrote more about this at https://www.jefftk.com/p/make-buses-dangerous and https://www.jefftk.com/p/in-light-of-crashes-we-should-not-m...
We're making a niche B2B application, and this is very much it for us as well.
Our customers are in a cutthroat market with low margins. We can't spend a ton on pre-analysis, redundancies and so on.
Instead we've focused reduced the impact of failures.
We've made it trivial to switch to an older build in case the new one has an issue. Thus if they hit a bug they can almost always work around it by going to an older build.
This of course requires us to be careful about database changes, but that's relatively easy.
You can not. AI though, can be cheap enough to produce that. I wonder what happens if you take a b2b application and let it rewrite with AI to Nuclear Industry/ Aviation standards into a seperate repo. Then on fixes/rewrite the engineers take the "safety aware repository" as inspiration.
You’ve missed the point. Those standards don’t relate at all to writing code, they relate to process, procedure and due diligence - i.e. governance. Those all cost a lot in terms of man hours.
Exactly. Even without learning from those groups, there's a ton of stuff we know we could do to improve the reliability of our product. It's just that it would take way too much development time and our customers wouldn't want to pay for it.
It's like buying a thermometer from Home Depot vs a highly accurate, calibrated lab thermometer. Sometimes you just don't need that quality and it's a waste paying for it.
Yeah, it costs. That, and that people will accept shite software makes it high quality a fight software companies can avoid. Rationally therefore, they do.
What you're describing is almost exactly the opposite of what LLMs are good for. Quickly getting a draft of something roughly like what you want without having to look a bunch of stuff up? Great, go wild. Writing something to a very high standard, with careful attention to specs and possible failure cases, and meticulous following of rules? Antithetical to the way cutting-edge AI works.
Have you tried using an LLM to write code to any kind of standard? I recently spent two hours trying to get GPT 4 to build a fiddly regex and ultimately found a better solution on Stack Overflow. In my experiments it also produced lackluster concurrent code.
I don't think that's the right way to reason about it.
I find that I can learn a ton from those industries, and as a software engineer I have the added advantage of being able to come up with zero-cost (or low cost), self-documenting abstractions, testing patterns, and ergonomic interfaces that improve the safety of my software.
In software, a lot of safety is embodied in how you structure your interfaces and tests. The biggest cost is your time, but there are economies of scale everywhere. It really pays to think through your interfaces and test plan and systems behavior, and that's where lessons from these other industries can be applied.
So yeah, if you think of these lessons as "do tons of manual QA", you'll run into trouble resourcing it. But you can also think of them as "build systems that continuously self-test, produce telemetry, fail gracefully in legible ways and have multiple redundancies".
I agree in principle, but I don't think industries should be looking at current-day Boeing's engineering practices except for an example of how a proud company's culture can rot from the inside out with fatal consequences.
Reminder that this article was about an aircraft built by Airbus.
(Airbus is not Boeing.)
How are aeroplanes designed differently at Boeing vs Airbus? What's the secret sauce?
At this point the secret sauce is that the EAA isn’t tolerating the same degree of certification fucking and laxity from airbus, and that they generally seem to have their act together.
Like what’s the secret sauce of nvidia vs radeon or AMD vs intel? Reliable execution, seemingly - and this is an environment where failures are supposed to be contained to very specific rates at given levels of severity.
The FAA has gotten into a mode where they let boeing sign off on their own deviations from the rules, the engine changes forced the introduction of the nose-pusher-down system which really should have required training, but Boeing didn't want to do that, because the whole point of doing the weird engine thing was having ostensible "airframe compatibility" despite the changes in flight characteristics. And they have become so large (like intel) that they don’t have to care anymore, because they know there’s no chance of actual regulatory consequences, nor can the EAA kick them out without causing a diplomatic incident and massively disrupting air travel, so they are no longer rigorous, and we simply have to deal with Boeing’s “meltdown”.
And yes they should be doing better but in the abstract, certification processes always need to be dealing with “uncooperative” participants who may want to conceal derogatory information or pencil-whip certification. You need to build processes that don’t let that happen and nowadays there’s so much of a revolving door that they can just get away with it. Like none of this would have happened with the classified personnel certification process etc - it is fundamentally a problem of a corrupted and ineffective certification process.
This decline in certification led to an inevitable decline in quality. When companies figure out it’s a paper tiger then there’s no reason to spend the money to do good engineering.
The FAA’s processes are both too strict and too lax - we have moved into the regulatory capture phase where they purely serve the interests of the industry giants who are already established and consolidated, and they now serve primarily to exclude any competitors rather than ensure consistent quality of engineering.
The specifics are less interesting than that high-level problem - there obviously eventually would be some form of engineering malfeasance that resulted from regulatory capture, the specific form is less important than the forces that produced it. And that regulatory capture problem exists across basically the whole American system. Why do we have forced arbitration on everything, why are our trains dumping poison into our towns? Because from 1980-2020 we basically handed control of legislative policy over to corporate interests and then allowed a massive degree of consolidation. Not that airbus is small, but the EAA isn’t regulatory capture to the extent of most American bureaus.
Same way Samsung phones are not Huawei phones? Or BMWs aren't Lexus?
A pilot once explained to me..
Boeing planes (before MCAS): we have detected a problem with your engines, would you like to shut down?
Airbus planes: we have detected a problem with your engines, we have shut them down for you.
I think Boeing has had some difficulties. They have also had some undeniable successes. The 777 and 787 programs have no in-service passenger fatalities attributable to engineering errors to date. That's a monumental achievement.
The 787 has no hull losses at all right? And it’s been flying for 10 years now.
crickets, let's just randomise which sensor we use during boot, that ought to do it!
Epic fail indeed, costing many lives.
"AoA sensor" - Angle of Attack sensor.
And the reference is presumably to 737 MAX accident. https://www.afacwa.org/the_inside_story_of_mcas_seattle_time...
let's just build a system that pushes the nose down under those conditions, have it accept potentially unreliable AoA data, and not tell pilots about it!
When I worked in an industrial context, some coding tasks would seem trivial to today's Joe Random software dev, but we had to be constantly thinking about failure modes: from degraded modes that would keep a plant 100% operative 100% of the time in spite of some component being down, to driving a 10m high oven has the opportunity to break airborne water molecules from mere ambient humidity into hydrogen whose buildups could be dangerously explosive if some parameters were not kept in check, implying that the code/system has to have a number of contingency plans. "Sane default" suddenly has a very tangible meaning.
This to me is the biggest difference between writing code for the software industry vs. an industrial industry.
Software is all about the happy path ("move fast and break things") because the consequences typically range from a minor inconvenience to a major financial loss.
Industrial control is all about sad paths ("what happens if someone drives a forklift into your favorite junction box during the most critical, exothermic phase of some reaction") because the consequences usually start at a major financial loss and top out in "Modern Marvels - Engineering Disasters" territory.
What's fascinating about airplane design for me is not the huge technical complexity, but rather, the way it is designed such that a lot of its subsystems are serviceable by technicians so quickly and reliably, not just in a fully controlled environment like a maintenance hangar, but right on the tarmac, waiting for takeoff.
In the context of disasters that happened due to software failures (e.g. Ariane 5 [1]), one of my professors used to tell us, that software doesn't break somewhen but is broken from the beginning.
I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.
[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches
I think many of us are so used to working with software, with its constant need for adaptation and modification in order to meet an ever growing list of integration requirements, that we forget the benefits of working with a finalized spec with known constants like melting points, air pressure, and gravity.
Airliners face constantly changing specifications. No two airliners are built the same.
Do you mean no two individual planes? Like two 767s made a month apart, do you mean they literally would have different requirements?
I think they means that airplanes are made in different versions, catered to particular airline. Also planes are constantly updated.
Two 767 made few months apart will have initial difference, like two different versions of java 8 SDK.
Neat little detail of the world Wikipedia once told me: the 00 suffix of classic Boeing planes, dropped in 2016, was substituted with Boeing assigned customer code on registration documents. e.g. a PAN AM 773-300 would have been 777-321, an Air Berlin Jetfoil would have been 929-16J, and so on.
1: https://en.wikipedia.org/wiki/List_of_Boeing_customer_codes
Yes. There are constant changes to the design to improve reliability, performance, and fix problems, and the airlines change their requirements constantly.
I think they meant a 737-400 is different from a 737-500 is different from a 787 and a AirBus 320 and a MD-80 and…
Every single model is somewhat bespoke. There’s common components but each ends up having its own special problems in a way I assume different car models in a common platform (or two small SUVs from competing manufacturers) just don’t.
Completely agree - I think it can go one of two ways. Software is more malleable than airplanes are and that also comes with downsides (like how much time and effort it takes to bring a new plane to the market)
The article talks about a piece of software that partially failed, when they needed to calculate the braking distance for the overweight aircraft.
I was just thinking of this metaphor today.
Try drawing the software monstrosity you work on / with as an airplane. 100 wings sticking out all different directions, covered with instruments and fins, totally asymmetrical and 5 miles long. Propellers, jets, balloons, helicopter blades.
Yep, it flies.
When it crashes, just take off again.
If 200 people died after a db instance crashed, software would be equal in that regard.
To prove this, software that deals with medical stuff is somewhat more like aviation.
Also, aviation and software aren't orthogonal. E.g., the article mentioned that part of the reason the pilot was able to sustain a very narrow velocity window between stall and overrunning the runway was because of the A380's fly by wire system.
Yep. Insulin pumps can kill their owner and the software updates need to be FDA approved:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4773959/
Likewise, in "aviation" when the entertainment system completely fails in a 4 hour flight, there is most like no post mortem at all. They turn it off/on again just like most of us.
This is true in a lot of industries. Unless there’s 7+ figure costs or significant human losses, there’s usually not an exhaustive investigation to conclusively point to the exact cause and chain of events.
Some people who think this is ideal for any sort of software tech sound they would also want a 3 hour post mortem with whoever designed the rooms, after slightly stubbing a toe.
It took hundreds of subject experts from ten organizations in seven countries almost three years to reach that conclusion.
Here at HN we want a post mortem for a cloud failure in a matter of hours.
Something similar that struck me was that, in early February, Russia invaded Ukraine.
And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.
I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.
I work at mid tier FAANG, our SLA for post mortems have SLA in the 7-14 day period. Nobody seriously wants a full PM in hours.
They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.
Apples to oranges
And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.
My impressiom was that the defective part was still inside the engine when it landed.
Makes it even more impressive: the parts that were actually implicated in the explosion itself (and scattered from the aircraft) were not defective, so the investigation had to go through parts which did not seem to have exploded in order to track down the defect.
Or at least, I assume the turbine parts weren’t defective, although given what seems to be quite a happy-go-lucky approach to manufacturing defects in Hucknall, maybe my assumption is not made on solid grounds…
Probably a reference to other incidents. Shout out to the NTSB for fighting off alligators while investigating this crash... https://en.wikipedia.org/wiki/ValuJet_Flight_592
Aviation is great because the industry learns so much after incidents and accidents. There is a culture of trying to improve, rather than merely seeking culprits.
However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:
"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."
https://www.businessinsider.com/scammer-fooled-us-airlines-b...
I suspect this is precisely what is happening in Russian civil aviation now. No legit parts supplied, so there will be a lot of fake/problematic parts imported through black channels.
Admiral Cloudberg has covered a case where counterfeit or EOL-but-with-new-paperworks components were involved in a crash.
https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...
I agree, and also I enjoy the attitude. While in my profession the postmortems goal is finding who to blame, here the attitude is towards preventing it to happen again, no matter what. Or at least that’s how I feel.
Your profession? Or you mean your company? Unless it's a very specific profession I would not know, it would usually imply that the company is dysfunctional.
It must have something to do with the number of mistakes, otherwise it's all a waste of time!
It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?
Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)
One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.
The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.
They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.
For reducing their impact.
Aerospace things have to be like this or they just wouldn’t work at all. There are just too many points of failure and redundancy is capped by physics. When there’s a million things which if they went wrong could cause catastrophic failure, you have to be really good at learning how to not make mistakes.
Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.
The Checklist Manifesto (2009) is a great short book that shows how using simple checklists would help immensely in many different industries, esp. in medical (the author is a surgeon).
Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.
Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.
Yes, but they do have one critical failure mode: that the checklist failed to account for something (or that an expected reaction to a step being performed didn’t occur).
I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.
His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.
I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).
But people don’t tend to die when we mess up, so we don’t get that budget.
There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.
This kind of makes sense, but it is only possible because of public pressure/interest. Many people are irrationally emotional about flying (fear, excitement etc.), that's why articles and documentaries like this post are so popular.
On a side note, that's also why there's all the nomsense security theater at airports.
Hard agree. Civil & mechanical engineering have a culture and history of blameless analysis of failure. Software engineering could learn from them.
See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.
A colleague of mine came from a major aviation design company before joining tech and said they were in a state of culture shock at how critical systems were designed and monitored. Even if there are no hard real time requirements for a billing system, this guy was surprised at just how lax tech design patterns tended to be.