A Matter of Millimeters: The story of Qantas flight 32

I don't know about others, but I can't help but smile when I read the detailed series of events in aviation postmortems. To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry. I say that sincerely since mistakes are going to happen and in my view robustness has less to do with the number of mistakes but how one responds to them.

Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.

As a former Boeing engineer, other industries can learn a great deal from how airplanes are designed. The Fukushima and Deepwater Horizon disasters were both "zipper" failures that showed little thought was given to "when X fails, then what?"

Note I wrote when X fails, not if X fails. It's a different way of thinking.

As an engineer I think a lot about tradeoffs of cost vs other criteria. There is little I can learn from nuclear or aviation industry, as the cost structure ist so completely different. I’m very happy that the costs of safety in aviation are very good accepted, but I understand that few people are willing to pay similar costs for other things like, say, cars.

Cars might not be the best example, since human lives are at stake, as in aviation. Unless you work on Teslas autopilot, it seems. But yes, backups and restores are often good enough.

Any substantiation for "Unless you work on Teslas autopilot, it seems"?

I mean you're implying that there are more accidents with autopilot than without it, right? Seems like quite the claim...

Tesla people always try to reduce any critique to some metric on deaths per x.

The fact is, there’s a lot of history and best practice around building safety critical systems that Tesla doesn’t follow.

Additionally, even with the practices they follow, they call a consumer facing product that isn’t really an autopilot “autopilot”, while focusing outbound comms on a beta product that is more like an autopilot, but not available to them.

I agree with most of this but the naming of "autopilot" seems fine. Nobody expects commercial aircraft to fly on autopilot without a pilot's supervision, the same _should_ be true of Tesla vehicles (especially considering their tendency to jump into the wrong lane and phantom brake on the highway etc.)

What matters is what the user of the system thinks because that’s where confusion can be dangerous.

A plane pilot knows very well what the limits of the autopilot are and what the passenger believes is irrelevant.

Conversely if too many/most car “autopilot” users believe it does more than what it really does then it’s dangerous.

In electrical engineering 600V is still “low voltage”. Any engineer in the field knows that so that’s fine right? But if someone sells “low voltage” electric toothbrush or hand warmer no normal person will think “it’s 600V, it will probably kill me”. When you sell something, what your target audience takes away from your advertisement matters. If they’re clearly confused and you aren’t clearing it up after so many years then “confusion” and misleading advertising are part of your sales strategy.

No, I'm implying that the autopilot code has not been as thoroughly tested as it should have been.

Example: https://www.theguardian.com/technology/2023/nov/22/tesla-aut...

Considering Tesla was willing to do unsafe things in visible way (e.g, running stop signs feature), then I have no trust that they are maintaining safety in the less visible ways.

As it turns out (and as much as we wouldn’t want them to) human lives are still subject to cost/benefit analysis.

An airliner is a lot of lives, a lot of money, a lot of fuel, and a lot of energy. Which is why a lot has been invested in training, procedure, and safety systems.

Cars operates in an environment which is in most ways a lot more forgiving, they’re controlled by (on average) low-training low-skill non-redundant crews, they’re much more at risk of “enemy action”, the material stresses are in a different realm, and they’re much, much more sensitive to price pressure.

Hell, the difference is already visible in aviation alone, crop dusters and other small planes are a lot less regulated amongst every axis than airliners are.

I wouldn't say it's simply cost-benefit analysis. It's also scale of accidents.

A whole lot more people die from car accidents, yet there are few reports on national news on accidents. So fewer people care. Meanwhile each time there is an aviation disaster, 100s of people die and it's all over the news for weeks. Similarly with train accidents and nuclear accidents. There where only 2 very large ones but they still haunt the field to this day, while (for example) the deaths from solar installations by people falling from roofs are mostly ignored.

Large accidents have to be avoided, a lot of small ones are more acceptable.

I wouldn't say it's simply cost-benefit analysis. It's also scale of accidents.

But that is cost/benefit analysis. When any accident can kill hundreds and do millions to billions in damage besides (to say nothing of the image damage to both the sector and the specific brand), the benefit of trying to prevent every accident is significant, so acceptable costs are commensurate.

I think it goes beyond what you'd expect just from the increased scale putting more lives at risk. Compare our regulatory system for buses and cars, two transportation options that are probably as close as possible to differing only in scale. Buses are ~65x less deadly than cars, and yet we still respond to the occasional shocking bus accident by trying to make them safer.

Which is actually counterproductive! This makes it harder to compete as a bus service, bus lines shut down, and more people drive. I wrote more about this at https://www.jefftk.com/p/make-buses-dangerous and https://www.jefftk.com/p/in-light-of-crashes-we-should-not-m...

We're making a niche B2B application, and this is very much it for us as well.

Our customers are in a cutthroat market with low margins. We can't spend a ton on pre-analysis, redundancies and so on.

Instead we've focused reduced the impact of failures.

We've made it trivial to switch to an older build in case the new one has an issue. Thus if they hit a bug they can almost always work around it by going to an older build.

This of course requires us to be careful about database changes, but that's relatively easy.

You can not. AI though, can be cheap enough to produce that. I wonder what happens if you take a b2b application and let it rewrite with AI to Nuclear Industry/ Aviation standards into a seperate repo. Then on fixes/rewrite the engineers take the "safety aware repository" as inspiration.

You’ve missed the point. Those standards don’t relate at all to writing code, they relate to process, procedure and due diligence - i.e. governance. Those all cost a lot in terms of man hours.

Exactly. Even without learning from those groups, there's a ton of stuff we know we could do to improve the reliability of our product. It's just that it would take way too much development time and our customers wouldn't want to pay for it.

It's like buying a thermometer from Home Depot vs a highly accurate, calibrated lab thermometer. Sometimes you just don't need that quality and it's a waste paying for it.

Yeah, it costs. That, and that people will accept shite software makes it high quality a fight software companies can avoid. Rationally therefore, they do.

What you're describing is almost exactly the opposite of what LLMs are good for. Quickly getting a draft of something roughly like what you want without having to look a bunch of stuff up? Great, go wild. Writing something to a very high standard, with careful attention to specs and possible failure cases, and meticulous following of rules? Antithetical to the way cutting-edge AI works.

Have you tried using an LLM to write code to any kind of standard? I recently spent two hours trying to get GPT 4 to build a fiddly regex and ultimately found a better solution on Stack Overflow. In my experiments it also produced lackluster concurrent code.

I don't think that's the right way to reason about it.

I find that I can learn a ton from those industries, and as a software engineer I have the added advantage of being able to come up with zero-cost (or low cost), self-documenting abstractions, testing patterns, and ergonomic interfaces that improve the safety of my software.

In software, a lot of safety is embodied in how you structure your interfaces and tests. The biggest cost is your time, but there are economies of scale everywhere. It really pays to think through your interfaces and test plan and systems behavior, and that's where lessons from these other industries can be applied.

So yeah, if you think of these lessons as "do tons of manual QA", you'll run into trouble resourcing it. But you can also think of them as "build systems that continuously self-test, produce telemetry, fail gracefully in legible ways and have multiple redundancies".

I agree in principle, but I don't think industries should be looking at current-day Boeing's engineering practices except for an example of how a proud company's culture can rot from the inside out with fatal consequences.

Reminder that this article was about an aircraft built by Airbus.

(Airbus is not Boeing.)

How are aeroplanes designed differently at Boeing vs Airbus? What's the secret sauce?

At this point the secret sauce is that the EAA isn’t tolerating the same degree of certification fucking and laxity from airbus, and that they generally seem to have their act together.

Like what’s the secret sauce of nvidia vs radeon or AMD vs intel? Reliable execution, seemingly - and this is an environment where failures are supposed to be contained to very specific rates at given levels of severity.

The FAA has gotten into a mode where they let boeing sign off on their own deviations from the rules, the engine changes forced the introduction of the nose-pusher-down system which really should have required training, but Boeing didn't want to do that, because the whole point of doing the weird engine thing was having ostensible "airframe compatibility" despite the changes in flight characteristics. And they have become so large (like intel) that they don’t have to care anymore, because they know there’s no chance of actual regulatory consequences, nor can the EAA kick them out without causing a diplomatic incident and massively disrupting air travel, so they are no longer rigorous, and we simply have to deal with Boeing’s “meltdown”.

And yes they should be doing better but in the abstract, certification processes always need to be dealing with “uncooperative” participants who may want to conceal derogatory information or pencil-whip certification. You need to build processes that don’t let that happen and nowadays there’s so much of a revolving door that they can just get away with it. Like none of this would have happened with the classified personnel certification process etc - it is fundamentally a problem of a corrupted and ineffective certification process.

This decline in certification led to an inevitable decline in quality. When companies figure out it’s a paper tiger then there’s no reason to spend the money to do good engineering.

The FAA’s processes are both too strict and too lax - we have moved into the regulatory capture phase where they purely serve the interests of the industry giants who are already established and consolidated, and they now serve primarily to exclude any competitors rather than ensure consistent quality of engineering.

The specifics are less interesting than that high-level problem - there obviously eventually would be some form of engineering malfeasance that resulted from regulatory capture, the specific form is less important than the forces that produced it. And that regulatory capture problem exists across basically the whole American system. Why do we have forced arbitration on everything, why are our trains dumping poison into our towns? Because from 1980-2020 we basically handed control of legislative policy over to corporate interests and then allowed a massive degree of consolidation. Not that airbus is small, but the EAA isn’t regulatory capture to the extent of most American bureaus.

Same way Samsung phones are not Huawei phones? Or BMWs aren't Lexus?

A pilot once explained to me..

Boeing planes (before MCAS): we have detected a problem with your engines, would you like to shut down?

Airbus planes: we have detected a problem with your engines, we have shut them down for you.

I think Boeing has had some difficulties. They have also had some undeniable successes. The 777 and 787 programs have no in-service passenger fatalities attributable to engineering errors to date. That's a monumental achievement.

The 787 has no hull losses at all right? And it’s been flying for 10 years now.

When my AoA sensor fails, then what?

crickets, let's just randomise which sensor we use during boot, that ought to do it!

Epic fail indeed, costing many lives.

"AoA sensor" - Angle of Attack sensor.

And the reference is presumably to 737 MAX accident. https://www.afacwa.org/the_inside_story_of_mcas_seattle_time...

Airlines really want to be able to use pilots' existing type-rating on this hulking zombie of a 60s-era airframe with modern engines but it behaves differently under certain conditions, what do we do?

let's just build a system that pushes the nose down under those conditions, have it accept potentially unreliable AoA data, and not tell pilots about it!

When I worked in an industrial context, some coding tasks would seem trivial to today's Joe Random software dev, but we had to be constantly thinking about failure modes: from degraded modes that would keep a plant 100% operative 100% of the time in spite of some component being down, to driving a 10m high oven has the opportunity to break airborne water molecules from mere ambient humidity into hydrogen whose buildups could be dangerously explosive if some parameters were not kept in check, implying that the code/system has to have a number of contingency plans. "Sane default" suddenly has a very tangible meaning.

we had to be constantly thinking about failure modes

This to me is the biggest difference between writing code for the software industry vs. an industrial industry.

Software is all about the happy path ("move fast and break things") because the consequences typically range from a minor inconvenience to a major financial loss.

Industrial control is all about sad paths ("what happens if someone drives a forklift into your favorite junction box during the most critical, exothermic phase of some reaction") because the consequences usually start at a major financial loss and top out in "Modern Marvels - Engineering Disasters" territory.

What's fascinating about airplane design for me is not the huge technical complexity, but rather, the way it is designed such that a lot of its subsystems are serviceable by technicians so quickly and reliably, not just in a fully controlled environment like a maintenance hangar, but right on the tarmac, waiting for takeoff.

In the context of disasters that happened due to software failures (e.g. Ariane 5 [1]), one of my professors used to tell us, that software doesn't break somewhen but is broken from the beginning.

I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.

[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches

I think many of us are so used to working with software, with its constant need for adaptation and modification in order to meet an ever growing list of integration requirements, that we forget the benefits of working with a finalized spec with known constants like melting points, air pressure, and gravity.

Airliners face constantly changing specifications. No two airliners are built the same.

Do you mean no two individual planes? Like two 767s made a month apart, do you mean they literally would have different requirements?

I think they means that airplanes are made in different versions, catered to particular airline. Also planes are constantly updated.

Two 767 made few months apart will have initial difference, like two different versions of java 8 SDK.

Neat little detail of the world Wikipedia once told me: the 00 suffix of classic Boeing planes, dropped in 2016, was substituted with Boeing assigned customer code on registration documents. e.g. a PAN AM 773-300 would have been 777-321, an Air Berlin Jetfoil would have been 929-16J, and so on.

1: https://en.wikipedia.org/wiki/List_of_Boeing_customer_codes

Yes. There are constant changes to the design to improve reliability, performance, and fix problems, and the airlines change their requirements constantly.

I think they meant a 737-400 is different from a 737-500 is different from a 787 and a AirBus 320 and a MD-80 and…

Every single model is somewhat bespoke. There’s common components but each ends up having its own special problems in a way I assume different car models in a common platform (or two small SUVs from competing manufacturers) just don’t.

Completely agree - I think it can go one of two ways. Software is more malleable than airplanes are and that also comes with downsides (like how much time and effort it takes to bring a new plane to the market)

The article talks about a piece of software that partially failed, when they needed to calculate the braking distance for the overweight aircraft.

I was just thinking of this metaphor today.

Try drawing the software monstrosity you work on / with as an airplane. 100 wings sticking out all different directions, covered with instruments and fins, totally asymmetrical and 5 miles long. Propellers, jets, balloons, helicopter blades.

Yep, it flies.

When it crashes, just take off again.

If 200 people died after a db instance crashed, software would be equal in that regard.

To prove this, software that deals with medical stuff is somewhat more like aviation.

Also, aviation and software aren't orthogonal. E.g., the article mentioned that part of the reason the pilot was able to sustain a very narrow velocity window between stall and overrunning the runway was because of the A380's fly by wire system.

Yep. Insulin pumps can kill their owner and the software updates need to be FDA approved:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4773959/

Likewise, in "aviation" when the entertainment system completely fails in a 4 hour flight, there is most like no post mortem at all. They turn it off/on again just like most of us.

This is true in a lot of industries. Unless there’s 7+ figure costs or significant human losses, there’s usually not an exhaustive investigation to conclusively point to the exact cause and chain of events.

Some people who think this is ideal for any sort of software tech sound they would also want a 3 hour post mortem with whoever designed the rooms, after slightly stubbing a toe.

It took hundreds of subject experts from ten organizations in seven countries almost three years to reach that conclusion.

Here at HN we want a post mortem for a cloud failure in a matter of hours.

Something similar that struck me was that, in early February, Russia invaded Ukraine.

And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.

Here at HN we want a post mortem for a cloud failure in a matter of hours.

I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.

I work at mid tier FAANG, our SLA for post mortems have SLA in the 7-14 day period. Nobody seriously wants a full PM in hours.

They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.

Apples to oranges

To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry.

And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.

My impressiom was that the defective part was still inside the engine when it landed.

Makes it even more impressive: the parts that were actually implicated in the explosion itself (and scattered from the aircraft) were not defective, so the investigation had to go through parts which did not seem to have exploded in order to track down the defect.

Or at least, I assume the turbine parts weren’t defective, although given what seems to be quite a happy-go-lucky approach to manufacturing defects in Hucknall, maybe my assumption is not made on solid grounds…

Probably a reference to other incidents. Shout out to the NTSB for fighting off alligators while investigating this crash... https://en.wikipedia.org/wiki/ValuJet_Flight_592

Aviation is great because the industry learns so much after incidents and accidents. There is a culture of trying to improve, rather than merely seeking culprits.

However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:

"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."

https://www.businessinsider.com/scammer-fooled-us-airlines-b...

I suspect this is precisely what is happening in Russian civil aviation now. No legit parts supplied, so there will be a lot of fake/problematic parts imported through black channels.

Admiral Cloudberg has covered a case where counterfeit or EOL-but-with-new-paperworks components were involved in a crash.

https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...

I agree, and also I enjoy the attitude. While in my profession the postmortems goal is finding who to blame, here the attitude is towards preventing it to happen again, no matter what. Or at least that’s how I feel.

Your profession? Or you mean your company? Unless it's a very specific profession I would not know, it would usually imply that the company is dysfunctional.

robustness has less to do with the number of mistakes but how one responds to them

It must have something to do with the number of mistakes, otherwise it's all a waste of time!

It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?

It must have something to do with the number of mistakes, otherwise it's all a waste of time!

Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)

One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.

The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.

They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.

if it's not reducing the number of mistakes, what's it all for

For reducing their impact.

Aerospace things have to be like this or they just wouldn’t work at all. There are just too many points of failure and redundancy is capped by physics. When there’s a million things which if they went wrong could cause catastrophic failure, you have to be really good at learning how to not make mistakes.

you have to be really good at learning how to not make mistakes.

Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.

The Checklist Manifesto (2009) is a great short book that shows how using simple checklists would help immensely in many different industries, esp. in medical (the author is a surgeon).

Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.

Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.

Yes, but they do have one critical failure mode: that the checklist failed to account for something (or that an expected reaction to a step being performed didn’t occur).

I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.

His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.

I can only hope that the software/tech industry can one day be an equal in this regard

I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).

But people don’t tend to die when we mess up, so we don’t get that budget.

Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.

This kind of makes sense, but it is only possible because of public pressure/interest. Many people are irrationally emotional about flying (fear, excitement etc.), that's why articles and documentaries like this post are so popular.

On a side note, that's also why there's all the nomsense security theater at airports.

Hard agree. Civil & mechanical engineering have a culture and history of blameless analysis of failure. Software engineering could learn from them.

See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.

A colleague of mine came from a major aviation design company before joining tech and said they were in a state of culture shock at how critical systems were designed and monitored. Even if there are no hard real time requirements for a billing system, this guy was surprised at just how lax tech design patterns tended to be.

30 years ago I was in an emergency landing due to engine failure situation (flight attendants take away your shoes, practice crash position, rearrange the passengers etc) and the thing that stuck out the most for me was that everybody did as they were told. No self righteous people; it was clear to everyone why there are flight attendants aboard and that they were key to your survival. The evacuation was orderly, though the follow up was lengthy (e.g. everybody’s passport was still on board).

More recently I’ve seen pictures of people evacuating down the slides with their luggage! Seems incredibly dangerous, not just for the slide experience but in slowing down evacuation. We had no fire in the cabin but what if we had?

Oh yeah, you know the stereotype of the press sticking their camera in your face to see how freaked out you are? It does happen in real life.

You’re not supposed to take anything on the slides. No luggage. No shoes. Just you.

But it is ignored. Which is sad, people could really get hurt.

Your right though the fact as many people comply as they do is kind of incredible given how people act in other situations.

Why in the world do you have to take your shoes off before going down the slides? I could understand jackets or jewelry, but shoes?

High heels are not OK, for obvious reasons. Regular shoes are fine.

Not just high heels, but also many boots have sharp protrusions (e.g. lace hooks on some hiking boots and work boots, metal decorations on goth and cowboy boots)

They confiscated all our shoes. Crashing into someone at the bottom with shoes could be a problem too.

Ahh that makes way more sense. Thanks.

Huh. That sounded wrong so I googled it. I thought it was all shoes.

You’re right. What I said above used to be true. That seems to have been questioned in the 90s and in 2000 the FAA finalized a rule changing it.

The current recommendation (https://www.faa.gov/travelers/fly_safe/information) say you can keep your shoes on but to remove high heels, as you said.

A bit of googling says it was changed because of passengers injuring their feet on the terrain/debris after crashes. Additionally modern slides are much tougher than they used to be and won’t tear from shoes and probably even high heels.

But I bet high heels are probably not a smart thing to be wearing on possibly uneven debris covered terrain in an emergency when you need to move fast and safely.

Learn something new every day.

As silly as it might seem, you do something enough times and oddball rare things happen .. this is an instruction intended to reduce:

* shoes | boots with sharp objects embedded in soles (glass, bent nails)

* extra spikey high heels,

* work boots with hard edged metal hooks for laces,

(etc) causing damage to both inflatable slipways and to other passengers.

How often has a passenger going down an emergancy slide caused a rip that deflated that slide?

Not very often .. and aircrew are taught to issue instructions that make that as an unlikely occurence as possible.

Also, try to swim/stay afloat with shoes ... Apart from young athletes, most people will drown within a minute

I dunno — when I go camping by canoe, I keep my hiking boots on all the time: paddling, portaging, and yes, when having a swim during a break for lunch or after making camp). A disabling injury could be fatal.

And if something gets caught on the slide as you go down you could fall a dozen or more feet onto hard asphalt. Friend fell on a slide and got a compound leg fracture.

Many shoes have hard, sharp parts that could damage the slide, even to the point of complete deflation. There is no time to assess whose shoes would be safe and whose not, so the blanket rule is "no shoes".

Yeah, according to the linked article 5 - 10% of people are injured using the escape slides, which is why they waited for the stairs in this case.

They took out per shoes away so that was that. According to a parallel reply, they no longer do that.

Honestly, each and every one of those people should either be charged with reckless endangerments, put on an no-fly list or both. It really pisses me off when I see that. F**ing entitled idiots.

I remember there was this video of a plane in Russia that was on fire, multiple people died. And you see people walking away with their luggage, can’t help it think people would still be alive if it wasn’t for those who so urgently needed their suitcases.

People don’t follow the rules when they don’t trust their government for providing sane rules.

Case in point when you provide an example with Russia. Other example is Covid.

People don’t follow the rules because they are self-centered assholes who believe they are the main character of a movie, and they value their own personal convenience and comfort over the lives of other people they see as NPCs.

People aren't taking their luggage with them during an airplane fire because of their distrust of The Deep State.

Oh great.

Fire creates smoke. Smoke quickly makes people unconscious, then kills them. It doesn't matter whether you trust the government or not, whether you follow the rules or not, whether you hold your luggage or not, if you can't get out of smoke, your fate is pretty much set in stone. Putting the blame on people who never had the time in the first place, and couldn't supernaturally turn to liquid and go through the door all at once is like telling that those who were robbed could've trained themselves to run faster.

Although you can argue that Titanic could survive, if only had it blasted bass-boosted <anthem of a proper country>.

I’ve always wondered what happens after an emergency landing. Do you just kinda sit there and wait for bags and personal belongings to be offloaded? And then wait for another flight out?

And this is why fully autonomous flight control systems won't be certified for airliners in our lifetimes. While autonomous systems are capable of taking off, navigating to a destination, and landing they are largely incapable of handling major emergencies. It's impossible for engineers to foresee every possible failure mode and program for it.

It's impossible for engineers to foresee every possible failure mode and program for it.

Playing devil's advocate: you don't have to. It just has to be better than a pair of experienced airplane pilots working together. Which is still very hard, and there's still a good chance we won't see it in our lifetimes, but at least it's not impossible.

Also, let's not completely discount remote pilots

Let's completely discount remote pilots. There is no technology on the horizon which would solve the network latency or sensor fidelity problems that prevent remote piloting from being adequate for handling in-flight emergencies.

I don't claim to be knowledgeable. It's just a hypothetical question.

Surely, it depends on the nature of the emergency. As I understand it, in this Qantas example, the pilots did not need to fly the plane with real-time responses, just to make good decisions.

Let's not completely discount remote pilots, while recognising they are not a universal panacea.

They needed to make a lot of real-time responses when coming in to land, as they had a very narrow window of viable speeds and limited control.

That seems to be correct.

A partial mitigation of these issues could be high bandwidth / low latency networks just in take-off / landing corridors?

There are plenty of times the thing that happens and the bits that save the plane are in remote areas of up high.

They may still need help at landing, or by then it could be relatively normal.

But if you can’t provide that level of help everywhere (including over oceans) the design of the system is choosing to lose planes in a trade off for needing fewer human pilots.

There are probably hundreds of ways a plane could fail that would require constant low latency supervision by a pilot. For example, in this specific circumstance, the pilots had to manually maintain speed within a narrow range of 3-4 knots with a bunch of blown control surfaces.

Let's do completely discount remote pilots, please.

It’s worth pointing out that there are also plenty of airliner crashes that are attributed to pilot error.

Plenty compared to the set of airliner crashes, which is very small. There are also a lot of near misses that don’t turn into crashes precisely because of pilots being good at their jobs.

For AI to replace pilots, you don’t need to prove that sometimes humans fuck up. You need to demonstrate that AI would fuck up less often and in a more acceptable way. This requires looking at the big picture, not only bad cases.

Reading articles and seeing videos about airline disasters tends to increase my faith in flying rather than making me more afraid of it. Terrorism or sabotage aside, so many failures have to compound to put a modern airliner in a truly irrecoverable state that it's effectively impossible to happen and not worth my time to even worry about. What times we live in that we can hurtle ourselves across oceans at hundreds of miles per hour and be in substantially no more danger than we would be walking down a sidewalk in our home town (in before HN commenters reply with information about all the dangers associated with sidewalks).

That assumes the environment the aircraft flies in, behaves predictable. Sometimes it does not.

Turbulence is an obvious one. Downdrafts another. You can have a perfectly functional aircraft, but if the whole air column it's in goes down faster than the aircraft can climb, the aircraft will go down with the air column no matter what.

Reminds me of an Air Crash Investigation episode: some volcano had erupted, ash was high up in the air, air traffic control wasn't aware of this, and iirc it didn't show up on weather radar or similar systems (or on the planes' systems).

So it looked all clear. Meanwhile the whole plane was getting ash-blasted. To the point that paint was stripped, cockpit windows went from clear to matte, and ash attached itself to engine fan blades. Obviously trouble followed...

Bottom line: the environment a vehicle moves through, is always a factor. Sometimes an unpredictable, uncontrollable and/or hazardous one.

I'm not familiar with the volcano incident you referred to, but a bit of searching seems to indicate it was British Airways Flight 009 in 1982, where a 747-200 had all four engines fail due to volcanic ash… then glided safely out of the ash cloud and was able to restart three of the engines and land safely at a major airport. From a complete loss of power to all engines to on the ground with zero deaths, zero injuries. That's exactly the kind of story I'm talking about that gives me such faith in flying!

Sounds like the one! Engine after engine going out. Without (at first) any obvious cause.

From a complete loss of power to all engines to on the ground with zero deaths, zero injuries. That's exactly the kind of story I'm talking about that gives me such faith in flying!

Understood (and agreed). But you missed my point: fate of that flight didn't result from safety engineering. It depended entirely upon the ash-laden air it flew into, and its effect on the aircraft & its engines. No amount of systems redundancy could have made it a safe flight.

So yes: flying is very safe these days. But there are limits to what safety engineering can provide.

No amount of systems redundancy could have made it a safe flight.

But it did! One obvious example being the redundancy that allowed the plane to fly safely despite one of the engines not restarting.

The plane encountered an entirely unpredicted situation that caused damage, but thanks to its design was still able to land safely.

They got lucky because when they descended after the engines died, the engine cooling caused small physical size changes and the caked/burned ash just fell off the rotor bits allowing the engine to work again.

To RetroTechie’s point, they got lucky. No design decision saved them. Without that they’d have been a glider until they hit 0ft and it likely would have been far worse.

We’ve clearly gotten very good at flying, managing most weather conditions we’re likely to fly through, the mechanics/maintenance of the planes, and pilot training.

I’ve gained a ton of appreciation for how detailed our preparations are from watching Air Disasters. But we just can’t control everything, some danger is inherent.

Today we have VA monitoring satellites and aircraft aren't routed through VA.

That flight is why, IIRC.

I was a nervous flyer until I worked on the Boeing 757 design and found out how all the redundancy, etc., worked.

I was a nervous flyer until I piloted a 737 completely from power off to takeoff. I did the exact moves that I HATED as a passenger. Turns out you can’t control the wind and ATC transmissions eat into your mental capacity sometimes leaving you “behind the ball”. The result was a take off flying above speed causing the auto throttle to reduce engine speed greatly (the feeling as though the pilot turned off the engine in mid climb), turning to match ATC requested heading and banking a little bit more than expected passenger comfort, and finally reducing flaps without banking or otherwise reducing AoA giving that weightlessness rollercoaster feeling during the climb. All of this in a span of 5 minutes.

Once I got into our regular flight profile and following our flight plan, I just sat back in my seat and let out a hysterical laughter. I am the calmest person when I fly now :)

And that's how software should be written too..

What an extraordinarily detailed writeup.

It's a rewrite done by ChatGPT, you can see it by some of the adjectives used (that no humans would use), and some of the non-natural sentence structures, but the underlying content behind it is great.

Not sure about GPT, but it's hella overwrought - it's like someone took the investigation PDF and tried to make the cheesiest Lifetime movie out of it. Far too many superfluous adjectives and embellishments; barely readable IMO.

eg "The red-hot, wildly spinning disk instantly fractured into several sections, which rocketed outward in multiple directions at incomprehensible speed"

Well the whole point of articles like this is that they are more "literary" than the investigation reports and therefore more entertaining and engaging to read.

There's a time and place for reading dry technical investigation reports and this is not one of them.

Also, none of the adjectives you highlighted are beyond human comprehension or usage or even rare so it's certainly not an example of what parent was trying to convey.

Taking a sentence and doubling the world count with folksy-sounding adjectives does not actually make the prose more "literary". It's just bad copy...

Are you seriously accusing Admiral Cloudberg of writing articles with ChatGPT?

It's an especially ironic accusation given she's currently dealing with YouTube channels stealing her write-ups and reading them into a video using AI. She even had a "if you read this you're an AI bot" section in an article a few weeks ago.

I'd love to see you demonstrate "adjectives that no humans would use". Do tell.

Which adjectives specifically? And which sentence structure? I did not see anything out of the ordinary for such a technical discussion.

My first job was working at a mro that overhauled engines a bit smaller than the Trent 900s but same principles apply.

I built qa software to digitize the forms and signature process like what’s mentioned in the article as having not correctly been signed off on.

I ate lunch with repair engineers that had dark wells of knowledge about the engines they worked on. They could talk so deep on a subject that lunch break was over and we’d resume conversation over weeks.

There’s a paragraph in this post that hits a few points that are very subtle. The missing sign offs and engineers not knowing the process and and and. I think the criticism of RR is valid here. The qa manager at the mro I worked at was a force of nature. He was feared and uncompromising. He was also the signature that could cause an engine shutdown in flight. I admired this person and still do.

There’s small issues like this that go on every day on every engine model all over the world. There’s thousands of engines flying right now that have little defects that could cause a shutdown. There’s issues that have been identified, signed off as low risk and will be checked next time the engine comes in for overhaul.

There’s engineers out there that see the same fault, a premature cracked pipe, carbon buildup, abnormal corrosion, after a while of seeing this problem, they’ll raise the paperwork which will go up the chain and sit. It may be ignored, taken for information for future designs, identified as something that should be fixed or monitored or the frequency of monitoring increased. Maybe the part life will be reduced or you will be forced to NDT the part at each overhaul.

The cheese wheel concept is great as these systems are so complex there’s always going to be some issues.

As for Qantas, near the end it mentions the plane was repaired at great cost. It’s a source of company pride that they’ve never lost an airframe. They repair planes which are BER (beyond economic repair) just to keep this record.

I had a small experience with RR as a company through a contract. Including some time spent in Derby.

The things I saw left me question how any innovation could happen at all in there or why we did not have a much higher rate of fuck-you-shima per year or how the hell plane engines are not exploding daily.

IIRC the B777 engine controllers are still m68k. Discontinued in 1995.

IIRC the B777 engine controllers are still m68k. Discontinued in 1995.

That seems sensible? You’d need a really compelling reason to rewrite the entire control software and recertify the engine to match. Especially for an engine which has seen no order in 15 years.

The planes are still in service and need new engines and even existing engines require spare parts.

What I heard was that there was quite a scramble to buy up all existing supply and also talk some alternate manufacturers into continuing production at a low rate.

B777 was introduced in 1995. Having an engine controller that is obsolete and not available any more at the moment it is launched, seems a bit shortsighted to me.

Then again it works, the planes are flying in the end it's fine.

B777 was introduced in 1995. Having an engine controller that is obsolete and not available any more at the moment it is launched, seems a bit shortsighted to me.

First, in 1995 Motorola stopped development of the ISA, that says nothing about chip manufacturing which is what RR or airlines would care for. Ti launched the 68k-powered 89 three years later, and only switched away with the N-Spire CAS in 2007. Pilots launched 68k-powered Pilots in 1997.

Second, the early 90s were a time of flux for ISA and you could not necessarily know the plans of your provider, the 68k probably looked quite reasonable when RR started developing the Trents in the mid 80s. RR launched the 777’s 800 in 1991. And even after that, 68ks powered much of the early 80s and early 90s.

As for Qantas, near the end it mentions the plane was repaired at great cost

Indeed. Qantas has been ranked the safest airline int he world almost every year since forever [1]

I clearly remember when QF32 happened and everyone was utterly shocked. That simply DOES NOT happen to Qantas.

[1] https://www.forbes.com/sites/laurabegleybloom/2023/01/03/ran...

QANTAS has, for the last 10+, had a CEO who was not part of this culture and did everything he could to drive costs down. He laid off huge swaths of engineers, outsourced key maintenance contracts to the lowest bidder and left the airline with an aging fleet that needs billions spent to replenish. He was recently fired by the board for essentially destroying the reputation of the airline within Australia, with their practice of cancelling flights at short notice, illegally sacking thousands of staff during COVID and taking 100's of millions of dollars from the Australian government to keep staff employed during the airline's grounding during COVID and handing it all to shareholders.

It is a situation very similar to the downfall of Boeing.

The destruction of Qantas as a quality airline is entirely driven by exactly the same MBA/shareholder-value bullshit that destroyed Boeing and others.

Financial engineers should be banned from operating businesses. They are not focused on the quality of the business, from which profits are derived. They work backwards from their financially engineered results to drive down "costs", even if those "costs" are entirely essential to the operation of the business.

Qantas (and its subsidiary Jetstar) are having to recover their engineering, customer service, and other "costs" to actually achieve the operating business that their expensive tickets require. Currently they are being priced out of operating in Asia, not because they have too expensive operations, but because their board and CxOs were entirely driven by shareholders, not the ongoing operation of the business.

Agreed. I've worked in a company that was AS9001 certified, and pretty much the first things a quality auditor would have wanted to look at would be non-conformances and concessions. With than number of missing signatures we'd have been skinned alive, and it would likely have prompted the auditor to then turn the place upside down looking for more problems.

That would then have produced major failings in the audit, if not the outright revocation of the quality accreditation, which I would then expect to be followed up on by an audit from the customer (which in the case of TFA would be Rolls Royce), asking some rather uncomfortable questions of the management, examining whether the inter-company concession process was being adhered to, and perhaps reflecting internally (i.e. within RR) - "Do we think these folks are the right people to be making these parts for us?"

From what I've read here it seems to me that Rolls Royce were astonishingly lax in not riding their subcontractors nearly hard enough, quality wise.

The article is complex and well written, but I am a bit perplexed by the victorious tone and never-ending praise of safety. It resembles a sales pitch a bit too much, even though no one is selling anything. Maybe it's unintentional, and being around salesmen just does that to people.

If you are like me, you've probably said “hmm…” to yourself multiple times when certain things were mentioned, because those were things that actually didn't work (that they were left intact really boosts the credibility of the author). From calculation software that had never ever been tested with out-of-ordinary data to the computer keeping the broken engine running. From pure luck with fuel tanks being almost full and unable to explode to absence of any physical kill switch to stop the engine. An hour being generously available to go through ALL the checklists to clear the notifications. An hour of passengers and crew staying on top of the poodle of fuel hoping that nothing would ignite it. Finally, pure randomness in debris flying the way it did. It's not a story of “layers of safety” overlapping, it's a story of “layers of randomness” overlapping.

What would be really interesting is a distribution of outcomes for all possible trajectories of debris, i. e., how (un)lucky they actually were. I guess corporations don't release models like those to the public.

Also, that special chamber for oil filter requiring precise drilling of a perfectly fine pipe seems “ewww” to me. It is not serviceable anyway without reinstalling everything from scratch, as far as I understand, why not make it a single piece?

The author is positive because of all the safety layers that existed and staid intact, despite how flawed humans and companies are. The culture of looking at previous accidents like the UA232, where they lost ann engine and ALL controls with it, meant the A380 control system was engineered to take even more damage and it worked.

I do agree though it did not spend enough effort focusing on the areas to improve:

- A computer controlled engine that runs for 60 seconds while on fire, and lets a dangerous part spin too fast. It seems like something that should of been covered ahead of time.

- An engine manufacturing process that is so complex it’s almost impossible to validate.

- A fault management system that only shows you 1 or 2 at a time when you have 40.

I suspect the ECAM only showing a couple of failures at a time is a design feature, not a flaw, to prevent overwhelming the crew as they work through them

To me it’s impressive because presumably shards of debris cutting through so many distinct parts of the plane at the same time like this is a rare thing compared to more localized failures which the plane would be designed for. Yet all the different failsafes still worked enough to get the plane safely to the ground.

It is very common and encouraged to add a "What went well" in post mortems. This is not a pat yourself on the back moment. It is to reflect on what failed and what didn't.

The victorious tone comes in my opinion (though I'm projecting a bit) from this graph[0].

There has been very systematic and deliberate effort to better aviation safety DESPITE commercial pressures.

The swiss cheese means that there are many more layers of randomness that have to line up. Many of those layers came from previous accidents. Those layers are not random at all. Also none of those layers are hole free.

If that disk had disintegrated differently a potentially different set of layers would have applied. Would it have meant fatalities? Possibly. Would it have instantly blown up the plane? We don't know.

But it is pretty obvious that had many of those layers not existed then the chances of a much more disastrous outcome would have been much higher.

[0] https://upload.wikimedia.org/wikipedia/commons/e/ef/Fataliti...

They do have multiple kill switches to stop the engines, up to dumping a bunch of flame retardant into it which makes it impossible to restart. The problem was that all these systems for the #1 engine were rendered inoperable by the damage caused by the failure of the #2 engine.

Certainly there was a fair bit of luck involved as well.

I guess it's a glass half full type situation. There's a lot of universes where that plane did not make it back and a lot of decisions aligned to ensure that it did.

The one thing which sticks out to me is the ECAM system including a baked corrective message of "open fuel transfer valves" due to the imbalance.

That seems like an odd message to include in an emergency action system, which by definition is only active in unexpected situations. Is there really no system to confirm if a fuel leak is happening?

A320/330/340/350 driver here (can't get away from Airbus apparently).

Nope, there is no system to confirm a leak apart from a camera around the tail if you're lucky enough to have one, my previous airline had a flight where an engine leak was detected this way. Think about it, how would you design such a system? So this falls on the crew.

The procedure to determine if you have a leak is pretty much the same across types: add the fuel on board (FOB) to the fuel used (FU) and make sure that the number you get is the same as what you started the flight with. If it's less by some margin then you probably have a leak. You can confirm further by looking at tank quantities (but they take time to reduce depending on the size of the hole). If you get an engine or pylon leak then you might also see increased fuel flow on that engine. If the leak is elsewhere in the system then you might notice a smell. If you can't work it out then the procedure (at least on Airbus types) usually involves turning an engine off to see if the leak stops (yep, really).

As for the ECAM "open fuel transfer valves" message, I don't know for sure on the 380 but all the other Airbus types I've flown have something like:

.IF NO FUEL LEAK

FUEL IMBALANCE....MONITOR

So it doesn't really instruct you to open the transfer valves but leads you into the fuel imbalance procedure if you think you need it. The very first line of the fuel imbalance procedure says something like "Don't apply this procedure if fuel leak is suspected".

You could absolutely design a system that could detect a leak. I’m guessing that it’s just not common enough, or at least catastrophically common enough, to warrant.

At its simplest you measure estimated volume delivered to the engines against estimated volume remaining in the tank. Both are things that should be digitally measurable.

The problem seems to be that the only case it really matters is in a catastrophic accident where such measurements are going to be broken anyways.

It’s a good idea, some aircraft have quite complex fuel systems though so it would have to account for fuel moving between tanks.

E.g. the A330 has an inner tank in each wing (which itself can be split into two compartments if damaged), an outer tank in each wing and fuel in the horizontal stabiliser which is used for CG control in the cruise. All of that plumbing can leak too. You’d be adding significant weight and complexity implementing leak detection across all that.

Regardless of all of this, the aircraft is still fully controllable even with a total asymmetry (one side empty the other full) so balancing the tanks isn’t a massive priority.

All of that only adds complexity in the calculation, not the measurement.

The engines have predictable fuel consumption patterns. Even if fuel move across a bunch of tanks, you can still calculate total onboard fuels and detect a leak.

Thank you for bringing your expertise here. I was wondering if you could give some insight on something that occurred to me while reading this: at first sight, transferring fuel to the leaking tanks might seem to be a substitute for the failure of the fuel jettison system, while also doing something about the increasing lateral imbalance.

That’s good lateral thinking :)

Given that the aircraft can be landed over max landing weight (needs a maintenance inspection) and is still controllable with total imbalance I’d say that balancing just wasn’t as pressing of a concern.

Also, with that much damage you never really know where else it could be leaking. Leaking fuel into critical spaces of the aircraft could be bad so turning on the fuel crossfeed might add extra issues.

For Boeing aircraft you compare the totaliser fuel quantity with the calculated quantity based on engine fuel burn to determine a leak.

I'm wondering why the tolerances for the oil pipe were so small in the first place. Why not make the pipe one or two mm thicker?

Because its dead simple to machine a center and the designer did not factor in machinist/engineer/qa/facility incompetence.

You don’t need any such incompetence in this case, as explained in the article, though it does help and that specific facility had several issues. The tube was built to spec, it’s the specs that were not what they should have been.

The failures were more with the whole process (like the reference points with different tolerances and the inadequate paperwork) rather than machinist incompetence. They are just the guys at the bottom.

The engineer documents did not match the design documents. Incompetence number one. The machinist would have seen with the naked eye very easily that the hole was not close to center, an old salt would have raised it up. Incompetence number 2. The machinist not being aware that moving the jig was ruining the setpoint. Incompetence number 3. There were clearly incompetent individuals working at the facility. I get what you are saying... Don't blame the individual but best thing you can do from a process perspective is hire good people.

I know very basic machining, but I know that part looks almost so simple I could manufacture it.

It’s very interesting that there were not wall thickness measurements. That would have solved this whole issue.

Because there are a zillion important parts on the airplane, if you make each one heavier than it needs to be, the airplane will be nailed to the tarmac.

That makes sense. Here’s the question I left the article with:

Why not counterbore the pipe before installation, so it’s a trivial process?

Would it then not survive welding perhaps?

It would add up weight wise, and it’s one of the simpler parts. Jet engines are high performance precise machines with many quickly spinning parts. If you can’t bore a tube correctly how are you going to machine a high efficiency, balanced turbofan system?

That said it seems like did have a poor process where a part could be out of spec and they had no good way to check it. As they mentioned about Swiss cheese, you want as many layers as possible, and checks like that are needed.

I was on the flight and took the picture referenced as "A passenger took this photo in flight, showing turbine fragment exit holes in the upper surface of the wing. (ATSB)" Forced myself on another A380 flight shortly after so I won't lose faith in it's engineering safety.

Wow. I was (long ago!) in an engine fire emergency landing situation and though I did take a connecting flight to get home I didn’t fly for a while afterwards. Psychologically, your choice was probably the smarter one.

I've been in a couple situations.

- The main one was that I had a flight from Vancouver to Victoria and the weather was too bad for the helicopter to fly. So we took a prop. On takeoff, some cross-wind hit the plane and we tipped over. My colleague and I who were sitting across from each other thought that was it.

- The other one was my plane was reported crashed when I was visiting my parents for some holiday or other. I got panicked call on drive back from airport.

On takeoff, some cross-wind hit the plane and we tipped over.

I had a near tip-over coming out of a DIA years ago. DIA gets very windy. We were nearing speed to lift off and a gust of cross wind hit the plane. Looking out the window I thought for sure the wing was going to hit the ground, but in that moment the pilot seemed to shift from a standard take off to something that felt much more vertical. Once we were airborne the flight attendant came by who looked a little shaken and offered me a free drink.

Hopefully without incident that time?

Thankfully yes! I lived in Singapore at the time and thought... my goodness. It's a small island. If you end up afraid of flying, what do you do!?

Kudos to the Qantas crew on board as well as Captain de Crespigny and his co-pilots and two check captains. We happened to have a lot of experienced pilot power on board.

A video from that time: https://youtu.be/U8Un2boLZD8

Good on you!

there are some crazy talented pilots out there who are able to perform under massive amounts of pressure, United Flight 232 is a more extreme version of this article

https://en.wikipedia.org/wiki/United_Airlines_Flight_232

Despite the fatalities, the accident is considered a good example of successful crew resource management. A majority of those aboard survived; experienced test pilots in simulators were unable to reproduce a survivable landing. It has been termed "The Impossible Landing" as it is considered one of the most impressive landings ever performed in the history of aviation

plane lost all hydraulics and had to be steered and crash landed using only the engines

Errol Morris made an exceptional documentary about UA232. One of the pilots just looks into the camera and tells the story. https://www.youtube.com/watch?v=nf33RDu_D6M

Not just any camera - an Interrotron!

That is an amazing story, thanks for sharing it. This part leapt out at me:

Rescuers did not identify the debris that was the remains of the cockpit, with the four crew members alive inside, until 35 minutes after the crash.

I can't imagine spending a half hour waiting to be rescued, not knowing whether any of your passengers had survived.

Article by the same author as the submitted one on this: https://admiralcloudberg.medium.com/fields-of-fortune-the-cr...

One thing that jumped out at me was the narrow range of safe airspeeds on the landing approach--only three or four knots between stall and max speed not to overrun the runway. Quite a good piece of flying to get the plane down safely, not to mention all the other things the crew had to do.

Yep, they were very heavy and needed to land essentially at stall speed - which they basically did seeing as the stall warning chimed in moments before touch down - in order to allow for as much space as possible to stop the plane. I took from the article that their calculations were kind of hacked together with a number of overrides, so I guess they erred on the side of caution in case any of the assumptions needed a margin of error.

Amazing article. So well written. Kudos to the Qantas flight team, especially the pilot - they know their stuff for sure. And also kudos to the Airbus engineering team, that was such an epic win for redundant systems.

(It was interesting to see how stopping calculations were improved as part of the post mortem, for one.)

especially the pilot

Worth noting that there was an unusual flight crew: 3 captains (one to check the captain's proficiency, and another to check the checker's proficiency) plus the first and second officers.

Plus the off-duty one upstairs watching the tail camera on the entertainment console. Article says 140 years of combined experience between them which is more than impressive. Airbus really couldn't have hoped for a better crew for this to eventually happen to.

One of my favourite things about the A380 is that in-flight live feed from the tail, surprised more planes don't do it. Offers visual detail of the entire topside and a lot of information that might not otherwise be available.

This is definitely arm-chair quarter-backing, but wouldn't ground effect allow for a lower stall-speed?

With the engine ruptured, still operating and the fuel leaking, how did they know the whole thing is not gonna explode at any moment?

They didn't, which is why they shut it down.

I've read the ruptured engine was still operating for 3 hours after the plane landed.

Engine 1 was continuing to operate after landing because the fuel shutoff valves were inoperative. Engine 2 was the engine that had the uncontained failure.

The ruptured engine (#2) was shut down within a minute of the incident, in-flight. Two of the remaining engines (#3, #4) were shut down after landing. The last engine (#1) could not be shut down and had to be drowned in fire-fighting foam. This is in the article.

A turbine disc fragment ripped through the entire plane cross-section and exited on the other side. Stunning.

I like the pragmatic engineering point of view described in the article:

For engineering purposes, disk fragments are assumed to have infinite energy at the moment of release; they will cut through any reasonable material and cannot be contained.

Also demonstrated by the picture in the op of the brick wall. Note that it wasn't smashed or knocked down, but looks as if it was cut.

That must have been the section that broke downward, so was traveling faster than terminal velocity.

An amazing recovery, there's even an Air Crash Investigations episode about it:

https://imdb.com/title/tt3234896/

Love that show. Makes me wish we dealt with software even a tiny bit like that. Checklists alone for troubleshooting common customer problems would save so much hassle.

But so many companies (including mine) still work on more of the “heroic” model where it’s up to individuals to just learn the hard way through helping lots of customers and noticing patterns.

I'm trying to introduce SRE as a practice in to my organization. We don't have anywhere near the safety requirements of aviation or medical or power generation software, but our operations do affect thousands of people around the world.

Getting people to understand that SRE is a code of practise and an overall approach has been very difficult, even with the so-called "QA" team, who think their job ends when the latest upgrade is deployed.

We do work in public transport, and the best solution I've found so far, is when they say they're "done", I ask them whether they are willing to stand at the railway station at peak hour and explain to passengers why they can't get home on time (or to work).

The usual result is that they go away and think about it and there is more testing done. But getting that to be a standard approach and way of thinking is very difficult, especially when product owners and project managers are only focussed on the next milestone/payment.

The thing is that you end up having to be very process heavy. From an efficiency, rather than safety perspective, I had an offer from (and interviewed with--in that order) Boeing many moons ago. The thing I remember from a long-ago dinner after that interview was a guy who had spent a couple of years on some design tweak that saved some fraction of a percent on fuel consumption. That's the sort of thing that most engineers do in aviation (where it's perfectly appropriate).

My internal alarm bells started going off as soon as I read about datum AA and datum M. Shouldn't it be possible if not standard practice for the design software to issue a giant warning if you have a part that is defined by two datums that are almost but not quite the same? If they aren't the exact same datum then something like this will inevitably happen.

It’s definitely possible, but the checking system was probably built under the implicit assumption that the sources of truth for the checks (datums, dimensions) were correct.

2 datums might need to be very similar but not quite the same, so checking for it might present a lot of hard to handle false positives and make the system very complex.

There is nothing wrong with 2 datums, the issue is that during machining this is not "a part" but an assembly that moves. There are so many failings from my POV in the manufacturing process and verification of the part, which is summarized nicely by the following quote:

. Furthermore, initial inspections at the start of the production run were supposed to verify that the manufacturing process was creating products that satisfied the “design intent,” but the initial products were checked against the manufacturing drawings, not the design drawings.

To a half-competent machinist or manufacturing metrologist, half a millimetre of concentricity error on a part of that size might as well be half a mile. It's a huge, grievous error that can be seen with the naked eye. You don't get an error of that scale through normal variation, it's a clear sign of a serious problem with your setup.

This part of the article really leapt out at me:

The tolerance for this bore was supposed to be Ø 0.05 mm according to the design drawings, but was changed to Ø 0.5 mm in the manufacturing drawings without explanation. Even so, the non-conformance on the accident hub was between Ø 0.90 and Ø 0.98 (an offset of 0.45–0.49 mm), which should have been flagged by the machine. The CMM records from the accident hub were not retained, so it was not possible for investigators to confirm that the error was actually registered.

The meaning might not be obvious if you've never worked in a machine shop, but it's crystal clear if you have. Many people at that plant knew that they were delivering out-of-spec parts. Everyone who handled that part could have told you at a glance that the counterbore was badly off-centre. Rather than going back to remake the parts, rather than figuring out why the parts were bad, they just went through the motions of QC, shipped them anyway, falsified documentation and discarded evidence. For all the complexity of the analysis, the root cause is blindingly simple - flagrant negligence, concealed by flagrant deceit.

The article said it wasn’t visible because that stub was machined after it was placed in the hub. Which begs the question “why would you weld a tube in place and then finish machining it after?” Maybe it was easier/faster to machine it while it was on a hub. Also, wasn’t there an oil filter that had to go in there? Wouldn’t the oil filter experience interference if the counterbore was offset?

Closing comment: damn I thought people paid more attention when building turbines.

Yes, but what the poster meant is that it would be and that is confirmed in the images.

Meanwhile on the ground, events were taking an unexpected turn. On Batam Island in Indonesia, debris from the №2 engine plunged into a populated area shortly after the failure, resulting in surprise and alarm. Among the debris was a large portion of the failed IP turbine disk, which fell with such force that it cleaved straight through a building, razing a brick wall. Thankfully, no one on Batam was hurt by the debris. However, photographs of locals holding airplane wreckage in what appeared to be Qantas livery were soon posted to Twitter, where they were taken as indications that a Qantas airplane had actually crashed somewhere over Batam. Qantas engineers already knew that the plane was still flying, but they were unable to contact the crew to find out more information. And outside that bubble, the news that a Qantas A380 had possibly gone down spread so quickly that even investors reacted while the plane was still in the air. In fact, the first time Qantas’s CEO learned of the situation was when he received a call asking why the company’s stock price was dropping.

Information flies so fast in the modern world. There is a classic XKCD about learning about an earthquake via Twitter moments before the ground starts shaking.

Oh you mean https://xkcd.com/723/

The crypto markets responded to Russia's invasion of Ukraine even faster than Twitter did. That was an interesting day

Just throwing this out there, if anyone reading this knows of other writers/blogs/books/etc. similarly looking into aviation engineering/failures I would love some recommendations.

https://avherald.com/

https://fearoflanding.com/

This article highlights the dangers from fake/illegitimate/non-oem aircraft replacement parts that are being used to repair aircraft.

https://www.reuters.com/business/aerospace-defense/engine-ma...

Doesn't make me feel comfortable about flying.

It doesn’t really specify the risk though, the parts may not be critical. I would hope regulations require independent certification for critical parts, but I’m scared to look.

Sometimes non-critical parts can cause a disaster though, as in Swissair 111, where arcing in the in-flight entertainment system led to a fire that quickly doomed the plane.

I am addicted to a fault to Mentour Pilot's studies of flight incidents. Again, here, he goes into greater depth:

https://www.youtube.com/watch?v=JSMe1wAdMdg

I like Mentour Pilot but the outcome of the incident is only revealed at the end.

Admiral Cloudberg's articles are more like Columbo: they start with what happened, and then go back in time to find out and explain all the little details that caused it. In a way it's much more logical that way.

Mentour Pilot constantly has to say "remember this, it will prove important later". But we don't know why it's important, and so we don't remember, and as a result the narrative is much less clear.

I was going to mention him! I found his channel in the last year and have loved watching his coverage, especially from the point of view of a pilot.

If you like this article you’ll also likely like the show Air Disasters too (also known as Air Crash Investigations and Mayday, depending on where you are). It goes into a lot of detail based on crash reports without sensationalizing things too, though not quite as far as this article.

So did he pass his check flight? ;)

I know airline regs and reality would never allow it, but I like to think the check pilot tore up the assessment form, and just walked into the CEO's office and plunked down the cockpit audio. "Yeah, he passed."

It may be a cliché to call someone a "national treasure", but I would take it a step further for Admiral Cloudberg: she is a world treasure.

Kyra has written so many great articles under her nom de cloud. Trust me, just pick any of them and you will learn something.

https://news.ycombinator.com/from?site=admiralcloudberg.medi...

there's a video podcast, too, which they should put on TV instead of whatever is on there now, overdramatized claptrap

I wonder if there is any correlation with having been in the Air Force and handling these high stress civilian airline near disasters.

Both this captain and the Sullenberger of thMiraclenon the Hudson were Air Force (RAF and USAF respectively). Since, you will be going against an enemy who may damage your aircraft, there is likely more training on how to assess and recover from damage as well as how to handle these types of situations.

From watching Air Disasters pilots with military training has helped out a number of times.

However such pilots being very authoritarian or having bad crew resource management and not listening/refusing to let the copilot help has caused a number of accidents (or been a contributing factor) numerous times too.

Really interesting read

Raymond Babbit would be pleased to hear the passengers landed safely.

Coincidentally I just finished reading the self-authored book ("QF32") of the pilot's own recount of the day. The book leads in with many interesting life experiences that led him to make so many good life-and-death choices that day.

Engineering at its finest. Lots of problems but multiple layers and layers of redundancies that prevented a major issue from becoming a bigger issue involving souls

Very lucky they had that mastermind crew. The facts they had to keep within 3 knots of an ideal landing speed indicates how hard this was to get out of. They landed 150m from end of runway (perfect for the scenario). Amazing.

Fantastic write up, and amazing testament to the engineering in the A380. It’s extremely impressive that the pilots were able to safely land the plane with such extensive damage to so many separate systems.

I applaud this article for being so thorough and informative on this subject. I hope Airbus changes their mind about building new ones. We like them even if airlines don't like the operating costs.

It's so interesting to read on aviation postmortems even for people that don't fully understand all the technical details as explained in the article (great by the way, kudos to the author). I've always wondered if there is an authoritative database with all significant events in aviation to read and learn more about? Do pilots study past events as part of their training?

Absolutely astonishing and riveting read.

Given the sad state of the world in general, I am in awe of aviation industry because it actually works as designed, where all millions of potential points of failure are handled gracefully (and airlines are still profitable somehow).

A true miracle.

By specifying an landing weight in excess of the maximum, the system logic changed to apply the operational coefficient only once — for unrelated and obscure reasons — and lo and behold, when he ran the numbers this time, the computer said they could just barely land on any of the 4,000-meter runways at Singapore Changi Airport, with only 100 meters to spare. It wasn’t much, but with no better runways anywhere nearby, it would have to do.

Hacking overflows in an emergency, topnotch.

Incredible. That’s a fault tolerant system, operated by a highly knowledgeable crew. Congrats to all those involved, from system designers to pilots and crew.