return to table of content

Show HN: I built an open-source tool to make on-call suck less

solatic
32 replies
22h16m

In my current workplace (BigCo), we know exactly what's wrong with our alert system. We get alerts that we can't shut off, because they (legitimately) represent customer downtime, and whose root cause we either can't identify (lack of observability infrastructure) or can't fix (the fix is non-trivial and management won't prioritize).

Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).

Technical tools cannot fix culture problems!

edit: management not talking to engineers, or being aware of problems and deciding not to prioritize fixing them, are both culture problems. The way you fix culture problems, as someone who is not in management, is to either turn your brain off and accept that life is imperfect (i.e. fix yourself instead of the root cause), or to find a different job (i.e. if the culture problem is so bad that it's leading to burnout). In any event, cultural problems cannot be solved with technical tools.

whazor
9 replies
21h45m

Or maybe page your managers, such that they can escalate the situation. They will be more aligned on solving the cultural problems if they get waked up too.

anitil
4 replies
19h8m

I have half-jokingly suggested that an out-of-hours page should cost the company $10k to incentivise actually fixing problems rather than releasing broken products. But I haven't thought of a way of getting around the perverse incentive to create bugs in order to get the $10k

lupire
2 replies
17h46m

The cost is the cost of paying you to fix outages on overtime pay instead of working on the product.

hackeman300
1 replies
17h29m

Overtime pay? Is this common for oncalls? Never gotten it myself, every time I ask they reply with "just take the time back on another day" as if my time is fungible. Weekend time is worth far more to me than weekday time

Aurornis
0 replies
17h12m

Some companies do on-call bonuses, overtime pay for on-call incidents, or other schemes.

In my experience, it’s not a net win. They’ve budgeted the same amount for compensation either way, so you’re probably getting lower base comp if they’re allocating some of it for on-call.

It also creates an atmosphere where on-call becomes more normalized, because you’re getting paid extra to do it. Some people, usually young single people, will try to milk the overtime for as much as they can, dragging out the hours spent doing on-call work because every extra hour spent on the problem makes their paycheck bigger.

Aurornis
0 replies
17h15m

One company I worked for introduced a trivial ($100 or something) gift card bonus for closing a certain number of bug tickets.

The number of people who started pushing code with subtle bugs so they could create a ticket for it, fix their own bug, and get closer to that $100 gift card was shocking to me.

I can’t imagine the chaos that would occur if something came with a $10K bonus attached. Some people will bend over backward to get even tiny rewards. Dangling a $10K reward would get the wheels turning in their heads immediately.

blitzar
1 replies
20h31m

Or maybe page your managers, such that they can fire you

jobtemp
0 replies
20h23m

Then... problem solved!

aray07
1 replies
21h36m

yeah the best managers i worked with used to be on the same on-call rotation such that they would also get paged every time. That helped build empathy and visibility into the situation.

Too
0 replies
11h58m

Wouldn't the manager of one team be part of every shift in such a setup?

__turbobrew__
8 replies
21h56m

I work on a team which runs hyper critical infra on all production machines at BigCo and have the same experience as you.

The problem are not the alerts — the alerts actually are catching real problems — the problem is the following:

1. The team is understaffed so sometimes spending a few days root causing an alert is not prioritized 2. When alerts are root caused sometimes the work to fix the root cause is not prioritized 3. A culture on the team which allows alerts to go untriaged due to desensitization.

Our headcount got reduced by ~40% and — surprise surprise — reliability and on-call got much worse. Senior leadership has made the decision that the cost cuts are worth the decreased reliability so nothing is going to change.

The job market is rough so people put up with this for now.

callalex
2 replies
12h27m

When describing infrastructure, words matter. When you describe something as “hyper critical infrastructure” it implies that tens to thousands of human beings will die within seconds of failure of said “hyper critical” infrastructure. The way the rest of your comment is worded implies that’s not what you’re actually describing and makes the words “hyper critical infrastructure” irresponsible for you to use.

I don’t mean to imply there is some kind of failure magnitude competition, I just want to reinforce that software “engineering” already has a huge problem with abject neglect of the learnings that other sign-and-stamp engineering fields have already learned from and fixed. Us code slingers are not in uncharted territory, we just need to learn from our predecessors and peers that build literal bridges and towers and force management to treat our field in the same way.

dambi0
0 replies
12h3m

Words matter but so does context. You weren’t confused by the words here why assume others would be?

__turbobrew__
0 replies
2h23m

Hyper critical means if it stop working potentially billions of dollars are lost for the employees and shareholders. Given that the FAA values a human life at $9 million that actually fits your arbitrary criteria of what I am allowed to call my job.

lupire
1 replies
17h52m

It sounds like you forget to make an SLO? If an alert is not actionable because it's impossible to resolve, even though it has customer impact, then it should be an SLO, not an alert.

lazyasciiart
0 replies
10h30m

SLO as in “service level objective”? How does defining an SLO stop the existence of alerts?

kbar13
1 replies
18h26m

we as an industry need to have engineering management types realize that we cannot prioritize roadmap to the complete detriment of reliability

cdchn
0 replies
17h50m

If your org claims to be "customer obsessed" then reframe your alerts as what their impact to customers are. Don't say "elevated 502 errors" say "customers couldn't encountered errors X times."

mike_d
0 replies
17h33m

Start putting together conference bridges for "P1 customer outages" and have someone who is responsible for calling the developers, PMs, scrum masters, managers, etc. on the team and getting them all on at 1 AM to fix it.

hoistbypetard
4 replies
22h11m

That’s true. But technical tools can help you highlight culture problems so that they’re easier to to discuss and fix. It’s been a minute since I’ve had to process exactly the kind of on-call/alert problem we’re discussing here, but this does feel like the kind of tool that would help sell the kinds of management/culture changes necessary to really improve things, if not fix all of them.

djbusby
3 replies
20h1m

Switching tools, or adopting new (unproven) ones doesn't address or fix the communication issue.

The existing tools mentioned can show the metrics. Management needs an education - and that is part of the engineering job.

Aeolun
2 replies
19h51m

Management needs an education - and that is part of the engineering job

Isn’t that bizarre? In all my years as an engineer I can count the number of managers that went to learn about engineering by themselves, on one hand.

It’s literally their job, but somehow they feel they can do it without understanding it.

djbusby
1 replies
19h13m

I don't think it is bizarre. I see lots of MBAs running things. They don't have the engineering background, they have the "resources management" background.

I think engineer brings the numbers to management to decide course.

I prefer the situation where the CTO has no MBA and worked their way up - but that is uncommon IME.

So, in many orgs, engineer puts their comms hat on an presents a solid case.

The engineer who can communicate well, and show the metrics is typically the one who can get promoted to the decision maker role. First from the bottom up, then as a great leader

Aeolun
0 replies
15h35m

I don't think it is bizarre. I see lots of MBAs running things. They don't have the engineering background, they have the "resources management" background.

That part is fine. What I do not understand is why there is so little interest in learning what makes engineering different from running a widgets factory.

“Tell me why it won’t work” is a fine question, but it’d be nice if I didn’t have to force all their education on them.

E.g. how many managers ignore that oft repeated adage that 9 women cannot have a baby in a month, and just spam more people on a project in the hope it’ll go faster.

efxhoy
3 replies
20h8m

Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).

I was lucky enough to join a company where management does this. The managers were made to do this by experienced engineers who explained to them in no uncertain terms that stuff was broken and nothing was being shipped until things stopped being broken. Unless you have good managers this won’t happen without a fight and it’s a fight I think we as engineers need to take.

Some managers in other teams played the “oh it’s not super high impact it’s not prioritized” game, and those teams now own a bunch of broken stuff and make very slow progress because their developers are tiptoeing around broken glass, and end up building even more broken stuff because nothing they own is robust. Those managers played themselves.

Communication with management is bidirectional, sometimes they need a lot of persuasion.

paulryanrogers
1 replies
17h59m

Communication with management is bidirectional, sometimes they need a lot of persuasion.

Sounds like managing up, i.e. doing IC workload and the manager's job. Hard pass.

lupire
0 replies
17h50m

If you'd rather be miserable at work instead of content at work, that's a choice.

physicsguy
0 replies
11h13m

I tried that approach with a colleague and it just got more and more heated and frustrating. At the same time we were getting heat for reliability. I ended up quitting. Since then I heard from a colleague that they made some staff redundant, on a team that was already underwater.

I doubt very much that my experience was unique. In my new position we have the same problems with reliability but I don’t get involved in the political side of trying to argue about it, just turn up and do my 9-5. I’m a lot less stressed now!

aray07
2 replies
22h13m

I completely agree that technical tools cannot fix culture problems.

However, one of the things that I noticed in my previous companies was that my management chain wasn't even aware that the problem was this bad.

We also wanted to add better reporting (like the alert analytics) so that people have more visibility into the state of alerts + on-call load on engineers.

What strategies have worked well for you when it comes to management prioritizing these problems?

djbusby
0 replies
19h58m

Show them the costs! Wasted time, wasted resources, wasted money. Show the waste and come with the plan to reduce the waste. Alerts, on-calls and tests are all waste reduction.

"We're paying down our technical debt"

dennis_jeeves2
0 replies
21h26m

However, one of the things that I noticed in my previous companies was that my management chain wasn't even aware that the problem was this bad.

Isn't that a cultural problem?

cyanydeez
0 replies
22h3m

Obviously, the best way to get management's attention is to start a stop and frisk customer engagement plan.

dclowd9901
23 replies
21h53m

It reduces alert fatigue by classifying alerts as actionable or noisy and providing contextual information for handling alerts.

grimace face

I might be missing context here, but this kind of problem speaks more to a company’s inability to create useful observability, or worse, their lack of conviction around solving noisy alerts (which upon investigation might not even be “just” noise)! Your product is welcome and we can certainly use more competition in this space, but this aspect of it is basically enabling bad cultural practices and I wouldn’t highlight it as a main selling point.

gklitz
3 replies
10h34m

enabling bad cultural practices

I strongly disagree. There is nothing culturally bad in a system issuing an error if there is an error. Sometimes systems issue errors that are considered noise by supporters because they are not actionable, but forcing a system to not issue an error just because your support team cannot directly take action on it is an extremely odd leakage of team responsibilities and bound to have unintended consequences. Imagine a developer telling management that they didn’t implement error checking on some edge case because the support team told them they didn’t have documentation about how to take action for instance. The appropriate response there would be “why on earth are you asking support permission to add error messages for a known error?”. On the other hand, if a support team is drowning in noisy error messages they need tooling to make it easy to distinguish between those and other messages that need to be reviewed of have action taken.

remus
0 replies
10h30m

There is nothing culturally bad in a system issuing an error if there is an error.

That's true, but if the error says "PANIC! EVERYTHING IS DOWN" when it's not true, then it's asking for an action that's outsized to the problem. Error messages are fine, but they just need to be classified and responded to correctly, and noisy alerts are typically the ones that are misclassified and demanding attention they (probably) don't deserve.

madeofpalk
0 replies
6h46m

The context here is alerts triggering on-call.

If the error is not-actionable, why wake someone up in the middle of the night because of it?

I don't think anyone is rejecting the observability of these errors, but just that there's no point in having it alert/wake someone up unnecessarily.

everforward
0 replies
4h4m

Then it either shouldn't be an alert (and instead part of some kind of summary report or some such) or the devs need to take on call. It is an exercise in frustration for everyone to route the page to ops just to make ops call dev; that means dev still has to have an oncall rotation, they might as well just take the page directly.

The unintended consequence of forcing alerts down ops' throat is them gradually caring less about pages, because there's a very good chance that each one is unactionable. I've worked places that do this, I've seen it happen first-hand more than once.

It starts with frustration and ops being less helpful to devs, and ends in a jaded acceptance where ops people start telling each other "just close it and see if it happens a second or third time, that alert never means anything". At that point, the system may as well not emit the errors anyways because no one is looking at the alerts anyways.

DanHulton
3 replies
16h54m

I agree with your intent and desire, but the fact is that this problem _keeps happening,_ and we can't fix it by advocating "well, just do alerts better." There's a lot of cultural inertia at a lot of places that leads to creating too many, too low-signal alerts, and fixing that across an entire company -- or, hell, industry -- is a magnificently tall order.

However, installing a tool to specifically reign in those garbage noisy alerts is a potentially easy, significant win for the time and mental health of on-call engineers.

I mean, it sounds like you can then afterwards go in and identify the alerts that are just noise, and having that data means you can take action. Maybe contact the teams that are writing the noisiest alerts, or prepare some data-driven engineering standards for the company, whatever. But that still falls into "fix the culture", which is famously hard to do by fiat.

steve1977
2 replies
14h4m

If you work at a shitty place, focus your energy on leaving the shitty place.

karamanolev
0 replies
11h4m

The shittiness of a place is defined by a lot more, and more important, attributes than alert hygiene. Culture, pay, location, industry, leadership. If one leaves companies for (relatively) minor things like that, there are basically no companies left to work for.

jgraettinger1
0 replies
13h49m

Or, it can be self-evident to all that today's work can deliver more benefit if focused elsewhere, instead of root-causing fickle alert flakes. Customers buy products and services, not alert hygiene.

williamdclt
2 replies
2h28m

which upon investigation might not even be “just” noise

My company (like so many) is struggling a bit with culture around noisy alarm. Not only is noise tolerated, but when some closes an alarm because it's "known to be noise" and I prod them, it turns out that there is a very real impact on the user, it's just that nobody bothered to look into it. The alarm rings, the on-call hopes that it closes itself soon enough, it does so they consider it "false positive and noisy" even though there was impact on the user during these few minutes.

The only way to fight that is a zero-tolerance culture on alarms, which means no false positive is ever tolerated: fix it.

stackskipton
0 replies
2h13m

Ops person here, fixing it is generally impossible as most alarms are due to code and development teams don't give a shit. "It's just one false alarm occasionally, what's the big deal?" or "We are no longer working on that product." or "I have 14 features to do, go away ops and deal."

If we completely shut off the alarm, one time it's actually an issue, I'll get dinged for shutting down monitoring.

So yea, I mark it's priority as low so it doesn't wake me up in PagerDuty and move on.

bongodongobob
0 replies
53m

A lot of times ops can't fix it (false positive), it's another teams responsibility. And if that other team can't or won't fix it, ops is screwed with constant false positives. At that point, it's in ops best interest to ignore it, let the actual positive wreak havoc and point at the other team. If they don't have the capability of putting pressure on that team, sometimes a fire is the only way to do it. I'm not saying this is a good idea, but bureaucracy is going to bureaucracy. You have to make the other team feel your pain.

unethical_ban
2 replies
15h21m

If the tool works properly, then the value proposition is:

Sure, you could spend months trying to fix or tune out every useless alarm type, or try to hack your alert manager/email inbox with filters for things you know will get fixed in a few months - OR, you can use this tool that can quickly classify things as important or not.

callalex
1 replies
12h41m

If something that triggers a pager takes months to resolve you already work at an organization that is so ossified it will be unwilling to adopt a random bandaid startup product like this.

unethical_ban
0 replies
12h22m

I'm sure that's true in many places. The number of large companies deal with this kind of thing (understaffed teams operating hundreds or thousands of devices) is quite high. Some can pull off shadow IT or exceptions for free software.

stitched2gethr
1 replies
16h47m

If I may try to counter this point.

You are right, but there's ideal state and there's the real world. When on call most of my time is spent trying to make on call better. Reducing noise, providing more context when the alert and logs are lacking, and of course fixing the real issues that alerts have identified. That said there is a period of time in between receiving non-actionable alerts and classifying them as such, and more context without using brain power is always welcome. I think I'll give it a shot.

aray07
0 replies
15h1m

Thanks for the feedback. Yeah, ideally, teams would go back and fix their misconfigured alerts. Unfortunately, on-call ends and people forget. The aim was to both provide context when an alert comes up as well as provide a report at the end so everybody has context into the state of alerts.

renger6002
1 replies
13h41m

Agreed. This doesn't make the problem better, it's a bandaid solution that can make the problem worse by allowing you to ignore it for longer.

Iterating on your alarms is super informative about the underlying product. It'll point to how you might improve your KPI measurements, or find bugs you didn't know were there.

CPLX
0 replies
9h21m

Bandaids are a valuable and useful product used billions of times around the globe every year.

zbentley
0 replies
1h29m

Missing from the context of sibling replies is the (in my experience as an SRE, quite large) category of alerts that are mostly-but-not-entirely noise and high-effort-duration to properly improve.

Consider a not-that-hypothetical example: "host computer is unreachable" alerts that page oncall when they arrive for members of a fleet of critical database servers or replicas.

The alerts have proven their usefulness (they tend to arrive several minutes before application-level error spikes when a database is e.g. so overloaded the monitoring agent can't function or a replica is gone so changelogs are overflowing) ... when they're genuine. However, they're mostly not genuine: alerting agents crash and automatic-restart-service init scripts bug out or give up; per-database-owner customizations in hosts' available file descriptor numbers are propagated incorrectly to non-database services and prevent the alerting agent from running, databases that serve infrequent-but-critical on-demand reporting loads are subjected to tens-of-minutes-long load spikes during which the host is doing what it's supposed to but so pegged that the alerting agent won't work, and so on.

What do you do with those alerts?

"Just fix the problems causing the false positives!" Fine, but that takes a lot of time and coordinated effort, even if the oncall folks are empowered to prioritize the work getting done (which is far from a given at many companies, for reasons both good and bad): auto-restart-agent scripts can be replaced with better scripts (time, effort, debugging of some hokey bash that needs to run on a wide variety of environments) or systemd (time, effort, maintenance windows, and approval/retraining to update ancient linux distributions running critical databases). File descriptor/per-database tunings can be unified and continually audited for/invalidated before configs are pushed (developer effort, coordination with teams writing configs). Reporting databases can be upsized (money, maitenance windows) or the database processes can be moved into a cgroup to leave some resources to spare (effort, distro upgrades, maintenance windows).

That's going to take awhile, if it ever happens to completion.

Meanwhile, this "host unreachable" alert is useless 90% of the time and very useful (as in: it can be leveraged to prevent downtime for customers entirely) the remaining 10% of the time.

Like, sure, some of those issues are stupid. But none are hypothetical, all are younger than 5y, and I bet this kind of struggle is common and representative even at companies who are invested in operations and operations staff.

That's not an "inability to create useful observability", that's a genuinely hard problem resulting in noisy, spurious alerts that, depending on the rate-of-change/regulatory space of the company, might persist for months or years. What's more, alert management is an ongoing process. Even if one family of noisy alerts is addressed, another one will emerge as new behaviors and technologies are adopted.

I guess this is all to say that I don't think tools like Opslane (which I have not used) are "enabling bad cultural practices". Organizations that don't give a shit about operations will continue to suck at operations no matter what tools they use. But products Opslane are valuable even (especially?) in capable, operations-focused organizations as well.

remus
0 replies
10h33m

To give a more charitable take, in practice it's easy for noisy alerts to creep in and a tool like this could be a good nudge to dig a bit deeper on that alert and why it's gone from useful to noise.

devjab
0 replies
12h51m

As someone who has worked in non-tech startup transitioning into enterprise, or enterprise organisarions for decades… I can tell you that non of these companies or organisations are capable of making meaningful interactions with operations. Disclaimer I haven’t worked operations, but I’ve been in relative close proximity with them work wise for natural reasons or using their infrastructure and sometimes making their tools or helping them with things like automation and Powershell.

Anyway in lot of places management see IT as a necessary evil. Like a service center similar to HR but less popular because management genuinely don’t understand it and most IT departments lack HRs political shrewdness and communication abilities. At the same time it’s not uncommon for users to be unable to tell support if they’re on an Android or iOS device (yes, I’m serious). Sometimes employees won’t even differentiate between their professional and personal IT issues on their work devices. Which means that sometimes they’ll raise hell over things that might not warrant full alert systems for on-site support.

What might be challenging here is that you’ll still need someone to actually use the authors tool correctly. Though that is probably going to be a lot easier than making any sort of change management to was organisations relationship with IT.

aray07
0 replies
21h48m

Yeah, thats fair feedback. The main aim was to reduce the alert fatigue for on-call engineers and provide a way to get insight into the alerts at the end of the on-call shift.

This way there is data to make a case that certain alerts are noisy (for various reasons) and we should strive to reduce the time spent dealing with these alerts. Fixing some of them might be as easy as deleting them but for others might need dedicated time working on them.

Too
0 replies
11h7m

100x this. Garbage in = Garbage out.

In similar mindset, I've seen attempts to "fix" flaky test suites by retrying failing tests 5 times until they pass. What happens: You just set a new baseline of shit allowed. This allows even more noise to enter the system and you have to rerun them even more or need increasingly more advanced tools to filter out the noise.

Once the new baseline is anchored, you become dependent on the filter. Now every tool that interacts with the metrics need to be aware of the additional filter, there may be more than just the slack messages. Should your dashboard show the raw or the filtered metrics?

Devils advocate: consider an alert with 99/100 false positives. The LLM may be good at classifying it as noisy but will it do a better job than a human to react to the 1 true positive? Maybe, but at the same time it allows more such noise to accumulate in the system, in effect a net negative. It's better to remove such an alert instead. Even if the numbers were turned around in favor, that's a lot of complexity added.

The additional context this product provides may of course still be useful and i applaud the effort. This product space does have a lot of potential for growth and is a real pain for operators. Be careful with using it as a substitute for proper alert hygiene and culture.

lars_francke
9 replies
21h58m

Shameless question tangential related to the topic.

We are based in Europe and have the problem that some of us sometimes just forget we're on call or are afraid that we'll miss OpsGenie notifications.

We're desparately looking for a hardware solution. I'd like something similar to the pagers of the past but at least here in Germany they don't really seem to exist anymore. Ideally I'd have a Bluetooth dongle that alerts me on configurable notifications on my phone. Carrying this dongle for the week would be a physical reminder I'm on call.

Does anyone know anything?

michaelt
2 replies
21h45m

A candy bar cell phone, paid for by your employer and handed to whoever is on call. People who don't want it can just forward it to their phone.

lars_francke
0 replies
8h33m

Yup, this is something we've thought about as well. We're a remote company so everyone would need to get their own but this option is definitely on the table.

crawfishphase
0 replies
20h37m

in this case a satellite enabled candybar. the disaster recovery policy and budget should be applicable here. make sure its able to share xg and satellite tunnel for maximum value. ensure the reporting system is satellite enabled also. added points if its sending alerts 2 your handy byod. Disaster recovery is a big deal in 2024. All sorts of factors make satellite redundancy valuable in todays reality: Coworkers on a hike or a boat, random 0-day stuff, and war can cut your normal internet.. i have experienced all of these and only in the last 4 years and more than 1 time on each topic. Train your users to destroy it in case of war as its trackable by military tech. Put a sticker on it. Check out stackexchange for questions like this tho?

linuxdude314
2 replies
18h30m

This really sounds like a _you_ problem and not something you need hardware to fix.

You can already enable silence/focus time bypass modes for apps like PagerDuty and such…

If you can’t develop some sense of responsibility to check if you’re on-call, frankly you have no business being in an on-call role.

No hardware will make you or your engineers more diligent. The only reason pagers made more sense than phones is/was because of protocol reasons NOT because it’s some separate device.

lars_francke
0 replies
10h2m

Yup, this is absolutely a "me" problem. Life just gets in the way. I'm on call, the kids ask to go to the pool, I leave my phone in the locker forgetting that I'm on-call. I visit friends, forget to bring my laptop etc.

I realize it's a "me" problem and therefore I'm looking for a solution. Others in the company have the same problem. That said: This is my very own company and I have a great sense of responsibility but I also have a shit-ton of other things in my head and I'm not the only one with this issue here.

The silence etc. bypass doesn't always work (I commented in another thread).

agent13
0 replies
14h1m

Wow, stackoverflow vibes here.

jobtemp
2 replies
20h5m

There are phone apps that can pierce through all silent or DND settings. Get one of those. If the same app could buzz on less than 50% battery to remind to charge that would help. Also same app could request to confirm on call status so the don't forget. If they don't confirm someone else gets the shift.

Geezus_42
1 replies
16h39m

OpsGenie has that. We use it at my job. I'm not sure what problem OP is having. The phone call, text message, and app alert from OpsGenie are more than enough. The notification configuration is extremely flexible and each user can customize it as needed. From a user perspective, I don't know what else you could want.

I have no affiliation with OpsGenie outside of using at work.

lars_francke
0 replies
10h9m

We have OnePlus users with issues: https://support.atlassian.com/opsgenie-android/docs/troubles...

And we have one user with another brand that also locks down the notification/alert settings and kills apps in attempt "to save battery" which can't be controlled.

aflag
8 replies
7h16m

It feels to me that using LLM to classify alerts as noisy is just adding risk instead of fixing the root cause of the problem. If an alert is known to be noisy and have appeared on slack before (which is how the LLM would figure out it's a noisy alert), then just remove the alert? Otherwise, how will the LLM know it's noise? Either it will correctly annoy you or hallucinate a reason it figures that alert is just noise.

aray07
5 replies
6h33m

yeah, thats the goal of adding the context and the report - to hopefully bring awareness to the team that this alert should be removed.

My rationale for flagging the alert was to help prioritization for the on-call (lets say there are multiple alerts going off at the same time)

ozim
4 replies
5h12m

That’s a people problem and you cannot fix people problems with tech. If no one cares to do the good job of managing alerts putting AI in front of it will not change that.

jrochkind1
2 replies
4h9m

An AI could help bring to your attention alerts that need managing. I like it for this better than for someone in the moment of receiving an alert deciding whether or not to pay it attention.

ozim
1 replies
3h49m

If someone ignores alerts they will keep ignoring them but now you automated part of ignoring with "AI" and human at the end still will ignore alerts the same.

Writing it out makes me laugh because that's like something from Douglas Adams stories. Automated Ignoring System along with Infinite Improbability Drive.

jrochkind1
0 replies
50m

I was thinking of simply pointing out which kinds of alerts need to be tuned to be less noisy.

digging
0 replies
3h26m

you cannot fix people problems with tech

For very specific values of "people problem", "fix", and "tech". In reality, a more true (and relevant) assertion is "appropriate tools can make virtually any problem more tractable."

For example, it takes an annoyed engineer to notice that the same flaky alert keeps going off and is noise. Then it takes non-trivial skill on their part to communicate the need to disable that alert. They will meet non-trivial resistance, because disabling alerts is dangerous. However, if the tools they are using say "This is a noisy alert, it hasn't been useful for 6 months," disabling that alert becomes more of a best practice for the organization.

jofer
0 replies
3h40m

There is a lot to be said for "smoke test" metrics. Things you expect to have frequent false positives, but are sometimes early indicators of larger problems or indicators of where to look deeper if something else goes sideways. They're not things that should wake you up in the middle of the night, but they're a damn valuable tool to quickly figure out what's actually wrong when a "real" alert triggers.

Many of these lend themselves well to dashboards instead of alerts, but not everything is "dashboardable". Sometimes it's good to have a set of low-priority alerts that are treated differently than others.

E.g. "we're not receiving any data/requests". Sometimes that's just a lull in activity. Maybe a holiday. Sometimes it's because everything _else_ is broken and nothing is getting in (e.g. DNS issues).

With that said, I do think that classification should be made manually and not automatically.

tryauuum
7 replies
23h21m

every time I see notifications in Slack / Telegram it makes me depressed. Text messengers were not designed for this. If you get the "something is wrong" alert it becomes part of history, it won't re-alert you if it's still present. And if you have more than one type of alert it will be lost in history

I guess alerts to messengers are OK as long it's only a couple manually created ones, and there should be a graphical dashboard to learn the rest of problems

stackskipton
1 replies
22h25m

Why? We send alerts to Slack and Pagerduty. Slack is to help everyone who might be working, PagerDuty alerts the persons who are actually in charge of working on it.

Aeolun
0 replies
19h45m

Yeah, I think it’s convenient. We use email, but for the same thing. If I inadvertedly break something, I’ll have an email in my inbox 5 minutes later.

aray07
1 replies
23h16m

Yeah, I agree that slack is not the best medium for alerts. I think we it has somewhat become the default in teams is that it makes it easy to collaborate while debugging. I don't know a good way to substitute that and share information.

What strategies have you seen work well?

tryauuum
0 replies
5h7m

I might have been lucky, most of my companies were big enough to have a dedicated person to watch the dashboard 24/7. And a human will mostly know when it's a good idea to escalate and wake up the rest of the team

I have no idea what's the best setup for small companies

spike021
0 replies
19h19m

I don't think Slack (or similar) should be a primary alert mechanism.

But, if alarms are configured in a clean way, ideally your team is getting some warnings and such there and then if there's an alert that needs to actually page, it sends that to PagerDuty or whatever platform you use along with another message to Slack.

northrup
0 replies
22h17m

THIS. Whispering into a slack channel off hours isn’t a way to get on-call support help nor is dropping alerts in one. If it’s a critical issue I’m going to need a page of some kind. Either from something like PagerDuty or directly wired up SMS messaging.

al_borland
0 replies
19h11m

I would expect anything notifying via Slack or text would have an accompanying incident ticket in the system of record.

We had a rule in my team (before a management change that blew it all to shit) that we don’t use email or messaging for monitoring. Everything goes into the SOR. Once it’s in the SOR, if people want emails, texts, or whatever, it let them know there is work to do, that’s up to the team. Others would make dashboards… lots of options once it’s in the system, and nothing gets lost.

For example, I went from a team that looked at tickets all day to one that mainly worked on user stories in Jira. Because no one was looking at the incidents in the SOR, things were getting missed. I wrote something to check for incident tickets assigned to our team every hour, and it would post them in our team chat so people knew there was work to do. Then once per day, it would post everything still unassigned, so if something was lost on that hourly post, it would annoy everyone once per day until it was assigned/resolved. It worked out decently well. If there was a lot of stuff, it would post a message to have someone actually login to the SOR and look at all our tickets. I would sometimes use the standup to assign stuff out and get some attention on it, if things were getting bad.

RadiozRadioz
7 replies
22h19m

Slack-native since that has become the de-facto tool for on-call engineers.

In your particular organization. Slack is one of many instant messaging platforms. Tightly coupling your tool to Slack instead of making it platform agnostic immediately restricts where it can be used.

Other comment threads are already discussing the broader issues with using IM for this job, so I won't go into it here.

Regardless, well done for making something.

FooBarWidget
3 replies
22h10m

Try Netherlands. We're Microsoft land over here. Pretty much everyone is on Azure and Teams. It's mostly startups and hip small companies that use Slack.

satyamkapoor
1 replies
21h51m

Startups, hip small companies, tech product based companies. Most non tech product based or enterprise banks in NL are on Teams

Aeolun
0 replies
19h35m

I really feel like the world would be a better place if it was illegal to bundle Teams like this…

darkstar_16
0 replies
13h41m

Denmark is the same. Only the smaller startups use Slack. Everyone one else is on Teams.

october8140
1 replies
15h41m

As Slack is not end to end encrypted we and I imagine many other companies cannot use it.

progbits
0 replies
2h23m

Slack is also extremely unreliable with notification delivery.

aray07
0 replies
22h17m

Thanks for the feedback. We want to get something out quickly and we had experience working with Slack so it made sense for us to start there.

However, the design is pretty flexible and we don't want to tie ourselves to a single platform either.

Flop7331
5 replies
18h16m

Is this for missile defense systems or something? What's possibly so important that you need to be woken up for it?

kgeist
3 replies
8h35m

System goes down or degrades in some other way at night and important customers with a different timezone get angry, threatening to leave? (happened with us a few times)

But I would'nt use LLM for it due to hallucinations

Flop7331
2 replies
6h4m

What's so important that your customers in other timezones feel like waking you up? If they're ready to walk that fast, you can't trust them not to ditch you for an alternative as soon as they find one.

kgeist
1 replies
5h26m

Well, I guess it depends on the business. I forgot to mention that we're B2B. For example, suppose a large food chain or a major bank has an important exam scheduled for their employees on a specific day. If our platform has a blocking bug, no one can proceed (some may be sitting in the class) because the developers are too busy sleeping. Some of our clients are also airplane pilot certification authorities, which have stricter requirements. When there's an alert, you never know if it affects small clients or large clients too.

You don't have to be a missile defense system to require a stable system where devs respond quickly...

Flop7331
0 replies
5h7m

But those are the kinds of scenarios where I imagine the sun still came up if comparable disruptions occurred prior to our current era of constant connectivity. We're too invested in the myth that our special problem can't wait half a day.

fragmede
0 replies
15h35m

On the Internet, with the war in Ukraine, that's entirely possible!

sanj001
4 replies
20h33m

Using LLMs to classify noisy alerts is a really clever approach to tackling alert fatigue! Are you fine tuning your own model to differentiate between actionable and noisy alerts?

I'm also working on an open source incident management platform called Incidental (https://github.com/incidentalhq/incidental), slightly orthogonal to what you're doing, and it's great to see others addressing these on-call challenges.

Our tech stacks are quite similar too - I'm also using Python 3, FastAPI!

jobtemp
0 replies
20h21m

Why not use statistics? Been reading about xmr charts recently on commoncog. That might help for example.

david1542
0 replies
4h11m

I'm curious about incidental :) how are you going to compete with other, well established IM tools like rootly, incident.io, firehydrant.com?

aray07
0 replies
20h7m

Thanks for the feedback! I saw the incidental launch on HN and have been following your journey!

Jolter
0 replies
10h53m

I wouldn’t say it’s particularly clever. It’s a fairly obvious idea to anyone who has worked with alerting through IMs. What it is, is /difficult/, because you really really need to avoid false positives. Probably lots of hard work involved. So kudos for making this work (if it works)!

ravedave5
4 replies
2h36m

The goal for oncall should be to NEVER get called. If someone gets called when they are oncall their #1 task the next day is to make sure that call never happens again. That means either fixing a false alarm or tracking down the root cause of the call. Eventually you get to a state where being called is by far the exception instead of the norm.

henryfjordan
2 replies
2h24m

Depending on the stakes this is a pretty dangerous attitude. The goal for oncall is to keep the website working, and if you're tuning for "never get paged" then you'll necessarily miss an incident eventually.

cdchn
1 replies
1h51m

If you make your goal as high availability as possible, and you only get paged on outages, then your goal should be to never get paged.

You should be building resilient architectures, not being on firewatch duty.

henryfjordan
0 replies
0m

This is a classic developer vs business incentives misalignment.

Developers don't want to ever be paged because they don't want to be bothered, but the business might be perfectly happy to pay you to be on firewatch duty.

Consider a "low traffic" alert, how can you tell the difference between a slow period at 3am on a holiday vs a true outage? You can't without someone getting up and testing if the site is still up. (Maybe you can automate that check but there's always edge-cases you can't automate).

OP seemed to suggest it's better to disable the alarm than to just suffer the false alarm every now and then. I doubt very much that the people paying you for the on-call service would agree though.

fnimick
0 replies
2h25m

I wish everyone shared your philosophy! I once worked at a company where it was expected to get 10+ pages per day, and worse, a configuration error by a customer success team would trigger an engineering page because the error handling didn't distinguish between a config problem and an actual system issue. It was insane.

Jolter
3 replies
10h45m

Telecoms solved this problem fifteen years ago when they started automating Fault Management (google it).

Granted, neural networks were not generally applicable to this problem at the time, but this whole idea seems like the same problem being solved again.

Telecoms and IT used to supervise their networks using Alarms, in either a Network Management System (NMS) or something more ad-hoc like Nagios. There, you got structured alarms over a network, like SNMP traps, that got stored as records in a database. It’s fairly easy to program filters using simple counting or more complex heuristics against a database.

Now, for some reason, alerting has shifted to Slack. Naturally since the data is now unstructured text, the solution involves an LLM! You build complexity into the filtering solution because you have an alarm infrastructure that’s too simple.

samcat116
1 replies
2h25m

The alerts being sent to Slack are normally from one of those alert databases (such as Prometheus and AlertManager). Slack isn't the source of truth for them, just a notification channel.

Jolter
0 replies
2h6m

Oh, Prometheus is good for metrics but it doesn’t hold alarms in the Fault Management sense, though. It only keeps the metrics and thresholds, checks for threshold violations, and then alerts via some mechanism.

If it were an alarm database, an operator would be able to 1. Acknowledge the alarm 2. Manually clear an alarm that was issued in error.

Without those mechanisms, alarm handling becomes really difficult for an ops team, because now all you have is either a string of emails or a chat log.

drivers99
0 replies
1h27m

The wikipedia for page for fault management has a "see also" for alarm management, which looks extremely relevant as well.

throw156754228
2 replies
12h21m

I don't want to be relying on another flaky LLM for anything mission critical like this.

Just fix the original problem, don't layer an LLM into it.

aray07
0 replies
6h30m

I agree - fixing the original problem is the main motivation.

We wanted to provide that awareness because a lot of teams arent fully aware how bad the problem might be (on-calls change weekly, there might be a bunch of other issues)

amelius
0 replies
5h34m

A central service might be in a better position to classify messages compared to lots of individual agents.

snihalani
2 replies
19h35m

can you build a cheaper datadog instead?

mikeshi42
0 replies
1h38m

we've leveraged Clickhouse/S3 to build a cost effective alternative to Datadog at https://hyperdx.io (OSS, so you can self-host as well if you'd like)

david1542
0 replies
4h14m

You have plenty of options. some of them are open source https://signoz.io/ https://coroot.com/

Did you search tools that are cheaper than DD?

maximinus_thrax
2 replies
22h12m

Nice work, I always appreciate the contribution to the OSS ecosystem.

That said, I like that you're 'saying out loud' with this. Slack and other similar comm tooling has always been advertised as a productivity booster due to their 'async' nature. Nobody actually believes this anymore and coupling it with the oncall notifications really closes the lid on that thing.

aray07
1 replies
22h7m

Yeah, unfortunately, I don't think these messaging tools are async. During oncall, I used to pretty much live on Slack. Incidents were on slack, customer tickets on slack, debugging on slack...

maximinus_thrax
0 replies
21h51m

That is correct, they are not. My former workplace had Pagerduty integrated with Slack, so I get it...

theodpHN
1 replies
21h51m

What you've come up with looks helpful (and may have other applications as someone else noted), but you know what also makes on-call suck less? Getting paid for it, in $ and/or generous comp time. :-)

https://betterstack.com/community/guides/incident-management...

Also helpful is having management that is responsive to bad on-call situations and recognizes when capable, full-time around-the-clock staffing is really needed. It seems too few well-paid tech VPs understand what a 7-Eleven management trainee does, i.e., you shouldn't rely on 1st shift workers to handle all the problems that pop up on 2nd and 3rd shift!

Aeolun
0 replies
19h33m

I guess 7-Eleven management trainees know that their company is just as replacable for their employees as their employees are to them.

racka
1 replies
22h28m

Really cool!

Anyone know of a similar alert UI for data/business alarms (eg installs dropping WoW, crashes spiking DoD, etc)?

Something that feeds of Snowflake/BigQuery, but with a similar nice UI so that you can quickly see false positives and silence them.

The tools I’ve used so far (mostly in-house built) have all ended in a spammy slack channel that no one ever checks anymore.

protocolture
1 replies
19h34m

I feel like this would be a great tool for people who have had a much better experience of On Call than I have had.

I once worked for a string of businesses that would just send everything to on call unless engineers threatened to quit. Promised automated late night customer sign ups? Haven't actually invested in the website so that it can do that? Just make the on call engineer do it. Too lazy to hire off shore L1 technical support? Just send residential internet support calls to the the On Call engineer! Sell a service that doesn't work in the rain? Just send the on call guy to site every time it rains so he can reconfirm yes, the service sucks. Basic usability questions that could have been resolved during business hours? Does your contract say 24/7 support? Damn, guess thats going to On Call.

Shit even in contracting gigs where I have agreed to be "On Call" for severity 1 emergencies, small business owners will send you things like service turn ups or slow speed issues.

nprateem
0 replies
2h38m

That's why it's always at least double time for call outs

jedberg
1 replies
2h18m

People do not understand the value of classifying alerts as useful after the fact.

At Netflix we built a feature into our alert systems that added a simple button at the top of every alert that said, "Was this alert useful?". Then we would send the alert owners reports about what percent of people found their alert useful.

It really let us narrow in on which alerts were most useful so that others could subscribe the them, and which were noise, so they could be tuned or shut off.

That one button alone made a huge difference in people's happiness with being on call.

tracker1
0 replies
58m

I worked on a small team that covered a relatively big site where there were so many alerts it was simply hard to track... They were all sent over email to a group list and most would just delete.

I spent about 3 months, each day trying to triage into buckets based on activity and dealing with whatever was causing the most alerts each day. Some came down to just tamping out classes of 4xx errors that should never have been in the email/alert system to begin with. Others came down to indexes to reduce load/locking/contention on some db tables. Others still were much harder to dig into.

Will say at the end of the 3 months, there was only a trickle of emails a day and the notifications were taken much more seriously after not being so overwhelming as to being ignored altogether.

edit: This was just the first thing I did each day was deal with one problem, then moving to new feature work... It wasn't assigned, as the company would always prioritize new feature work, it was just something I did for my own sanity.

deepfriedbits
1 replies
20h41m

Nice job and congratulations on building this! It looks like your copy is missing a word in the first paragraph:

Opslane is a tool that helps (make) the on-call experience less stressful.
aray07
0 replies
20h33m

derp, thanks for catching. It has been fixed!

Arch-TK
1 replies
9h54m

We could stop normalising "on-call" instead.

sir_eliah
0 replies
3h41m

Could you please elaborate?

voidUpdate
0 replies
11h33m

Filtering whether a notification is important or not through an LLM, when getting it wrong could cause big issues, is mildly concerning to me...

topaztee
0 replies
8h28m

co-founder of merlinn here: https://merlinn.co | https://github.com/merlinn-co/merlinn We're also building a tool in the same space with the option of choosing your own model (private llms) + we're open source with a multitude of integrations.

good to see more options in this space! especially OS. I think de-noising is a good feature given alert fatigue is one of the repeating complaints of on-callers.

throwaway984393
0 replies
12h16m

Don't send an alert at all unless it is actionable. Yes, I get it, you want alerts for everything. Do you have a runbook that can explain to a complete novice what is going on and how to fix the problem? No? Then don't alert on it.

The only way to make on-call less stressful is to do the boring work of preparing for incidents, and the boring work of cleaning up after incidents. No magic software will do it for you.

nprateem
0 replies
2h41m

Almost all alerting issues can be fixed by putting managers on call too (who then have to attend the fix too).

It suddenly becomes a much higher priority to get alerting in order.

mads_quist
0 replies
10h58m

Founder of All Quiet here: https://allquiet.app.

We're building a tool in the same space but opted out of using LLMs. We've received a lot of positive feedback from our users who explicitly didn't want critical alerts to be dependent on a possibly opaque LLM. While I understand that some teams might choose to go this route, I agree with some commentators here that AI can help with symptoms but doesn't address the root cause, which is often poor observability and processes.

lmeyerov
0 replies
22h53m

Big fan of this direction. The architecture resonates! The base lining is interesting, I'm curious how you think about that, esp for bootstrapping initially + ongoing.

We are working on a variant being used more by investigative teams than IT ops - so think IR, fraud, misinfo, etc - which has similarities but also domain differences. If of interest to someone with an operational infosec background (hunt, IR, secops) , and esp US-based, the Louie.AI team is hiring an SE + principal here.

c0mbonat0r
0 replies
3h6m

if this is open-source project how are you planning to make this a sustainable business? also why the choice of apache 2.0

Terretta
0 replies
2h43m

Note that according to StackOverflows dev survey, more devs use Teams than Slack, over 50% were in Teams. (The stat was called popularity but really should have been prevalence, since a related stat showed devs hated Teams even more than they hated Slack.) Teams has APIs too, and with Microsoft Graph working you can do a lot more than just Teams for them.

More importantly, and not mentioned by StackOverflow, those devs are among the 85% of businesses using M365, meaning they have "Sign in with Microsoft" and are on teams that will pay. The rest have Google and/or Github.

This means despite being a high value hacking target (accounts and passwords of people who operate infrastructure, like the person owned from Snowflake last quarter) you don't have to store passwords therefore can't end up on Have I Been Pwned.

T1tt
0 replies
9h20m

is this only on the frontpage because this is an HN company?

T1tt
0 replies
9h24m

how can you prove it works and doesnt hallucinate? do you have any actual users that have installed it and found it useful?

LunarFrost88
0 replies
1d22h

Really cool!

EGreg
0 replies
20h10m

One of the “no-bullshit” positions I have arrived at over the years is that “real-time is a gimmick”.

You don’t need that Times Square ad, only 8-10 people will look up. If you just want the footage of your conspicuous consumotion, you can easily photoshop it for decades already.

Similarly, chat causes anxiety and lack of productivity. Threaded forums like HN are better. Having a system to prevent problems and the rare emergency is better than having everyone glued to their phones 24/7. And frankly, threads keep information better localized AND give people a chance to THINK about the response and iterate before posting in a hurry. When producers of content take their time, this creates efficiencies for EVERY INTERACTION WITH that content later, and effects downstream. (eg my caps lock gaffe above, I wont go back and fix it, will jjst keesp typing 111!1!!!)

Anyway people, so now we come to today’s culture. Growing up I had people call and wish happy birthday. Then they posted it on FB. Then FB automated the wishes so you just press a button. Then people automated the thanks by pressing likes. And you can probably make a bot to automate that. What once was a thoughtful gesture has become commoditized with bots talking to bots.

Similar things occurred with resumes and job applications etc.

So I say, you want to know my feedback? Add an AI agent that replies back with basic assurances and questions to whoever “summoned you”, have the AI fill out a form, and send you that. The equivalent of front-line call center workers asking “Have you tried turning it on and off again” and “I understand it doesn’t work, but how can we replicate it.”

That repetitive stuff should he done by AI and build up an FAQ Knowledge Base for bozos and then only bother you if it came across a novel problem it hasn’t solved yet, like an emergency because, say, there’s a windows BSOD spreading and systems don’t boot up. Make the AI do triage and tell the differencd.

7bit
0 replies
9m

* Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

I don't understand this. Either the issue is important and requires immediate human action -- or the issue can potentially resolve itself and should only ever send an alert if it doesn't after a set grace period.

The way you're trying to resolve this (with increasing alert volumes) is the worst approach to both of the above, and improves nothing.