In my current workplace (BigCo), we know exactly what's wrong with our alert system. We get alerts that we can't shut off, because they (legitimately) represent customer downtime, and whose root cause we either can't identify (lack of observability infrastructure) or can't fix (the fix is non-trivial and management won't prioritize).
Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).
Technical tools cannot fix culture problems!
edit: management not talking to engineers, or being aware of problems and deciding not to prioritize fixing them, are both culture problems. The way you fix culture problems, as someone who is not in management, is to either turn your brain off and accept that life is imperfect (i.e. fix yourself instead of the root cause), or to find a different job (i.e. if the culture problem is so bad that it's leading to burnout). In any event, cultural problems cannot be solved with technical tools.
Or maybe page your managers, such that they can escalate the situation. They will be more aligned on solving the cultural problems if they get waked up too.
I have half-jokingly suggested that an out-of-hours page should cost the company $10k to incentivise actually fixing problems rather than releasing broken products. But I haven't thought of a way of getting around the perverse incentive to create bugs in order to get the $10k
The cost is the cost of paying you to fix outages on overtime pay instead of working on the product.
Overtime pay? Is this common for oncalls? Never gotten it myself, every time I ask they reply with "just take the time back on another day" as if my time is fungible. Weekend time is worth far more to me than weekday time
Some companies do on-call bonuses, overtime pay for on-call incidents, or other schemes.
In my experience, it’s not a net win. They’ve budgeted the same amount for compensation either way, so you’re probably getting lower base comp if they’re allocating some of it for on-call.
It also creates an atmosphere where on-call becomes more normalized, because you’re getting paid extra to do it. Some people, usually young single people, will try to milk the overtime for as much as they can, dragging out the hours spent doing on-call work because every extra hour spent on the problem makes their paycheck bigger.
One company I worked for introduced a trivial ($100 or something) gift card bonus for closing a certain number of bug tickets.
The number of people who started pushing code with subtle bugs so they could create a ticket for it, fix their own bug, and get closer to that $100 gift card was shocking to me.
I can’t imagine the chaos that would occur if something came with a $10K bonus attached. Some people will bend over backward to get even tiny rewards. Dangling a $10K reward would get the wheels turning in their heads immediately.
Or maybe page your managers, such that they can fire you
Then... problem solved!
yeah the best managers i worked with used to be on the same on-call rotation such that they would also get paged every time. That helped build empathy and visibility into the situation.
Wouldn't the manager of one team be part of every shift in such a setup?
I work on a team which runs hyper critical infra on all production machines at BigCo and have the same experience as you.
The problem are not the alerts — the alerts actually are catching real problems — the problem is the following:
1. The team is understaffed so sometimes spending a few days root causing an alert is not prioritized 2. When alerts are root caused sometimes the work to fix the root cause is not prioritized 3. A culture on the team which allows alerts to go untriaged due to desensitization.
Our headcount got reduced by ~40% and — surprise surprise — reliability and on-call got much worse. Senior leadership has made the decision that the cost cuts are worth the decreased reliability so nothing is going to change.
The job market is rough so people put up with this for now.
When describing infrastructure, words matter. When you describe something as “hyper critical infrastructure” it implies that tens to thousands of human beings will die within seconds of failure of said “hyper critical” infrastructure. The way the rest of your comment is worded implies that’s not what you’re actually describing and makes the words “hyper critical infrastructure” irresponsible for you to use.
I don’t mean to imply there is some kind of failure magnitude competition, I just want to reinforce that software “engineering” already has a huge problem with abject neglect of the learnings that other sign-and-stamp engineering fields have already learned from and fixed. Us code slingers are not in uncharted territory, we just need to learn from our predecessors and peers that build literal bridges and towers and force management to treat our field in the same way.
Words matter but so does context. You weren’t confused by the words here why assume others would be?
Hyper critical means if it stop working potentially billions of dollars are lost for the employees and shareholders. Given that the FAA values a human life at $9 million that actually fits your arbitrary criteria of what I am allowed to call my job.
It sounds like you forget to make an SLO? If an alert is not actionable because it's impossible to resolve, even though it has customer impact, then it should be an SLO, not an alert.
SLO as in “service level objective”? How does defining an SLO stop the existence of alerts?
we as an industry need to have engineering management types realize that we cannot prioritize roadmap to the complete detriment of reliability
If your org claims to be "customer obsessed" then reframe your alerts as what their impact to customers are. Don't say "elevated 502 errors" say "customers couldn't encountered errors X times."
Start putting together conference bridges for "P1 customer outages" and have someone who is responsible for calling the developers, PMs, scrum masters, managers, etc. on the team and getting them all on at 1 AM to fix it.
That’s true. But technical tools can help you highlight culture problems so that they’re easier to to discuss and fix. It’s been a minute since I’ve had to process exactly the kind of on-call/alert problem we’re discussing here, but this does feel like the kind of tool that would help sell the kinds of management/culture changes necessary to really improve things, if not fix all of them.
Switching tools, or adopting new (unproven) ones doesn't address or fix the communication issue.
The existing tools mentioned can show the metrics. Management needs an education - and that is part of the engineering job.
Isn’t that bizarre? In all my years as an engineer I can count the number of managers that went to learn about engineering by themselves, on one hand.
It’s literally their job, but somehow they feel they can do it without understanding it.
I don't think it is bizarre. I see lots of MBAs running things. They don't have the engineering background, they have the "resources management" background.
I think engineer brings the numbers to management to decide course.
I prefer the situation where the CTO has no MBA and worked their way up - but that is uncommon IME.
So, in many orgs, engineer puts their comms hat on an presents a solid case.
The engineer who can communicate well, and show the metrics is typically the one who can get promoted to the decision maker role. First from the bottom up, then as a great leader
That part is fine. What I do not understand is why there is so little interest in learning what makes engineering different from running a widgets factory.
“Tell me why it won’t work” is a fine question, but it’d be nice if I didn’t have to force all their education on them.
E.g. how many managers ignore that oft repeated adage that 9 women cannot have a baby in a month, and just spam more people on a project in the hope it’ll go faster.
I was lucky enough to join a company where management does this. The managers were made to do this by experienced engineers who explained to them in no uncertain terms that stuff was broken and nothing was being shipped until things stopped being broken. Unless you have good managers this won’t happen without a fight and it’s a fight I think we as engineers need to take.
Some managers in other teams played the “oh it’s not super high impact it’s not prioritized” game, and those teams now own a bunch of broken stuff and make very slow progress because their developers are tiptoeing around broken glass, and end up building even more broken stuff because nothing they own is robust. Those managers played themselves.
Communication with management is bidirectional, sometimes they need a lot of persuasion.
Sounds like managing up, i.e. doing IC workload and the manager's job. Hard pass.
If you'd rather be miserable at work instead of content at work, that's a choice.
I tried that approach with a colleague and it just got more and more heated and frustrating. At the same time we were getting heat for reliability. I ended up quitting. Since then I heard from a colleague that they made some staff redundant, on a team that was already underwater.
I doubt very much that my experience was unique. In my new position we have the same problems with reliability but I don’t get involved in the political side of trying to argue about it, just turn up and do my 9-5. I’m a lot less stressed now!
I completely agree that technical tools cannot fix culture problems.
However, one of the things that I noticed in my previous companies was that my management chain wasn't even aware that the problem was this bad.
We also wanted to add better reporting (like the alert analytics) so that people have more visibility into the state of alerts + on-call load on engineers.
What strategies have worked well for you when it comes to management prioritizing these problems?
Show them the costs! Wasted time, wasted resources, wasted money. Show the waste and come with the plan to reduce the waste. Alerts, on-calls and tests are all waste reduction.
"We're paying down our technical debt"
Isn't that a cultural problem?
Obviously, the best way to get management's attention is to start a stop and frisk customer engagement plan.