I was one of the users that went and reported this issue on Discord. I love Kagi but I was a bit disappointed to see that their status page showed everything was up and running. I think that made me a bit uneasy and it shows their status pages are not given priority during incidents that are affecting real users. I hope in the future the status page is accurately updated.
In the past, services I heavily rely on (e.g. Github), have updated their status pages immediately and this allows me to rest assured that people are aware of the issue and it's not an issue with my devices. When this happened with Kagi, I was looking up the nearest grocery stores open since we were getting snow later that day so it was almost like I got let down b/c I had to go to Google for this.
I will continue using Kagi b/c 99.9% of the other time I've used it, it has been better than Google but I hope the authors of the post-mortem do mean it when they say they'll be moving their status page code to a different service/platform.
And thanks again Zac for being transparent and writing this up. This is part of good engineering!
As an engineer on call, I have been in this conversation so many times:
"Hey, should we go red?" "I don't know, are we sure it's an outage, or just a metrics issue?" "How many users are affected again?" "I can check, but I'm trying to read stack traces right now." "Look, can we just report the issue?" "Not sure which services to list in the outage"
...and so on. Basically, putting anything up on the status page is a conversation, and the conversation consumes engineer time and attention, and that's more time before the incident is resolved. You have to balance communication and actually fixing the damn thing, and it's not always clear what the right balance is.
If you have enough people, you can have a Technical Incident Manager handle the comms and you can throw additional engineers at the communications side of it, but that's not always possible. (Some systems are niche, underdocumented, underinstrumented, etc.)
My personal preference? Throw up a big vague "we're investigating a possible problem" at the first sign of trouble, and then fill in details (or retract it) at leisure. But none of the companies I've worked at like that idea, so... [shrug]
Connect your status page to actual metrics and decide a treshold for downtime. Boom you’re done.
Does anyone serious do this?
That’s an honest question, from a pretty experienced SRE.
In a world of unicorns and rainbows, absolutely. In the real world, it's as you probably already know: it's not that easy in a complex enough system.
Quick counter-example for GP: what if the 500 spike is due to a spike in malformed requests from a single (maybe malicious) user?
A malformed request should not lead to a 500, they should be handled and validated.
Well, in the real world it might. It should trigger a bug creation and a fix to the code, but not an incident. Now all of a sudden to decide this you need more complex and/or specific queries in your monitoring system (or a good ML-based alert system), so complexity is already going up.
You need to validate your inputs and return 4xx
Yeah and you also shall not write bugs in your code. Real world has bugs, even trivial ones.
If your service is returning 5xx, that is the the definition of a server error, of course that is degraded service. Instead we have pointless dashboards that are green an hour after everything is broken.
Returning 4xx on a client error isn't hard and is usually handled largely by your framework of choice.
Your argument is a strawman
That's....super not true. Malformed requests with gibberish (or, more likely, hacker/pentest- generated) headers will cause e.g. Django to return 5xx easily.
That's just the example I'm familiar with, but cursory searching indicates reports of similar failures emitted by core framework or standard middleware code for Rails, Next.js, and Spring.
If input validation is not present in your framework of choice then the framework clearly has problems.
If you do not validate your inputs properly I am not sure what you are doing when you have a user facing applications of this size. Validating inputs is the lowest hanging fruit for preventing hacking threats.
Usually handled by the framework, you may have to write some code, I'd expect my saas provider to write code so that I know whether their service is available or not.
Query input validation is nearly a solved problem. If you don't I would argue this is an incident if in this case 500's are returned.
True, however it also doesn’t impact other users and doesn’t justify reporting an incident on the status page.
https://www.buildkitestatus.com/
Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.
Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.
Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.
Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.
Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.
Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.
Eventually, the monitoring service gets axed because we can just manually update the status page after all.
Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.
And that is before 'going red' has ties to performance metrics with SLA impacts ...
Which then means not going yellow or red technically constitutes fraud.
Not necessarily. The situation can be genuinely unclear to the point where it is a judgement call, and then it becomes a matter of how to weigh the consequences.
If you're asking how many users are affected, and your service is listed as green...
What if the answer is 0.00001%?
Still seems like a yellow to me.
I think your bit at the end is the most important.
ANY communication is better than no communication "everything is fine, it must be you" is the worst feeling in these cases. Especially if your business is reliant on said service and you can't figure out why you are borked (eg the github ones).
Your point highlights thinking about what's being designed.
everything is fine is different from nothing has been reported. A green is misleading, there should be no green as green is unknown, there should be nothing with a note that there's nothing, and that's not the same as a green light.
Once an ISP support person insisted that I drive down to the shop and buy a phone handset so I could confirm presence of a dial tone on a line that my vdsl modem had line sync on before they’d tell me their upstream provider had an outage. I was… unimpressed.
Better for the consumer, although not necessarily better for the provider if they have an SLA.
This is exactly why those status pages are almost always a lie. Either they need to be fully automated without some middle manager hemming and hawing, or they shouldn’t be there at all. From a customer’s perspective, I’ve been burned so many times on those status pages that I ignore them completely. I just assume they’re a lie. So I’ll contact support straight away - the very thing these status pages were intended to mitigate.
meh - no status page is perfectly in sync with reality, even if it's updated automatically. There's always lag and IRL, there's often partial outages.
Therefore, one should treat status pages conservatively as "definitely an outage" rather than "maybe an outage."
The simple fix is to have a “last update: date time”
Or you can build a team to automate everything and force everyone and everything into a rigid update frequency which becomes a metric that applies to everyone and becomes the bane of the existence of your whole engineering organization
IMHO, any significant growth in 500s (that's what I was getting during the outage) warrants mention on status page. I've seen a lot of stuff, so if I see an acknowledged outage, I'll just wait for people to do their jobs. Stuff happens. If I see unacknowledged one, I get worried that people who need to know don't and that undermines my confidence in the whole setup. I'd never complain if status page says maybe there's a problem but I don't see one. I will complain in the opposite case.
I'm only replying to the praise here - I too, although I haven't fully switched, had a very enticing moment with Kagi when it returned a result that couldn't even be found on Google at any page in the results. This really sold me on Kagi and I've been going back and forth with some queries, but I have to say that between LLMs, Perplexity, and Google often answering my queries right on the search page, I just don't have that many queries left for Kagi.
If Kagi would somehow merge with Perplexity, now that would be something.
Kagi does offer AI features in their higher subscription tier, including summary, research assistance, and a couple others. Plus I think they have basically a frontend for GPT-4 that uses their search engine for browsing, and they just added vision support to it today.
I don't subscribe to those features or any AI tool yet, just pointing out there could be a version of Kagi that is able to replace your Chatgpt sub and save you money
Is it as good as Perplexity though? I use ChatGPT for different purposes, I just thought that if Kagi would ally with Perplexity and benefit from its index (I'm not sure what Perplexity uses), it could get really good. I've only recently tried using Perplexity and I get more use out of it than I would with Kagi, it doesn't just do summarization, but I haven't seen what Kagi does with research assistance.
It's been a while since I've used Perplexity, but I've been finding the Kagi Assistant super useful. I'm on the ultimate plan, so I get access to the `Expert` assistant. It's been pretty great.
https://help.kagi.com/kagi/ai/assistant.html
It's worth noting that the status page software they use doesn't auto-update automatically.
https://github.com/cstate/cstate
I guess a status page that doesn't auto-update is good for PR, but it's not very useful to show... you know... the status.
Yeah I thought that was weird. An auto-updating page is worth the constant pings to the infra IMHO.
Microsoft is notorious for their lax status page updates...
Is there anyone who isn't?
I envy your experiences with other services. I've never seen any service's status page show downtime when or even soon after I start experiencing it. Often they simply never show it at all.
Also in the past, other times GitHub has not updated its status page immediately.