I've never seen an outage this big. Even the homepage doesn't load. We've had recurrent issues with Actions not running, but this seems a lot bigger.
The status page says all is well, though: https://www.githubstatus.com/. Hilarious.
I've never seen an outage this big. Even the homepage doesn't load. We've had recurrent issues with Actions not running, but this seems a lot bigger.
The status page says all is well, though: https://www.githubstatus.com/. Hilarious.
A reminder of how centralized and dependent the whole industry has become on GH, which is ironic, considering that git itself is designed to be decentralized.
Good opportunity to think about mirroring your repos somewhere else like Gitea or Gitlab.
They're already mirrored on my hard drive. That's how git works.
Github is more than a remote host for git repositories. It's become one of the major CDNs for software distributions. Github Pages host a majority of static sites that developer use. You won't be able to use Cargo, Nix, Scoop and other package managers right now because their registries have a critical dependency hosted on Github.
This is not to mention all the projects that rely on Github for project management, devops, community and support desk.
GitHub is also very international, I doubt isolated netziens like those from China are shielded from this outage. I imagine very, very few software shops are unscathed by this. The whole affair is very on brand for 21st century software which is to say pitiful.
We installed a private GitLab instance on our own servers exactly out of fear that Github might suddenly alter the deal or just cease operations. Pretty happy with our decision so far.
Do you mean you switched to self-managed GitLab, or you have a self-managed GitLab that you keep around as a backup plan?
Actually both. Our internal closed source projects are only in our GitLab. The open-source stuff is both on GitHub and our GitLab. Since our GitLab instance isn't public we only use the issue tracker on GitHub for public stuff.
Another bonus is that we don't pay Microsoft.
- Champion a hard to use VCS which to its credit is distributed
- Make everyone dependent on all the centralized features of your software to use Git[1][2]
- Now you have a de facto centralized, hard to use VCS with thousands of SO questions like “my code won’t commit to the GitHub”
- Every time you go down a hundreds-of-comments post is posted on HN
How to get bought for a ton of cash by a tech mega corporation.
[1] Of course an exaggeration. Everyone can use it in a distributed way or mirror. The problem occurs when you’re on a team and everyone else doesn’t know how to.
[2] I’m pretty sure that even the contributors to the Git project rely on the GitHub CI since they can’t run all tests locally.
The key difference is being able to mirror communication channels. While you can continue to work fine with your local repo, the only way to share those changes are via another forge, or sending patches through some other channel. Having another forge to distribute code is generally more ideal.
these things could happen anywhere though. Gitlab also `rm -rf`ed before remember
The odds of all services rm -rf / at the same time are pretty small to be honest. The point is to have your work in multiple places, such that you're not reliant on a single service.
I'm taking this opportunity to randomly shout out Gitea! I've self-hosted Gitea for 5 or 6 years, and it has been bulletproof.
It's not the whole industry, just some imprudent sections of the industry.
great point
I’m kind of surprised that gitlab doesn’t have a larger market share given that you can run it air gapped on-prem without too much fuss.
https://www.githubstatus.com/ reports no problems, but it's clearly down for a lot of people (including me).
It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...
There is always going to be SOME delay between the outage and the status page, although 5 minutes is probably enough time where it should be updated
after several minutes the status page is still showing all is fine.
For a service like GH, anything more than 30 secs is unacceptable
That is very unrealistic. Infrastructure monitoring at that scale won't even be collecting metrics at that interval.
And simple HTTP monitoring would be too flappy for a public status page.
What monitoring tools are you using? I know a ton that can do 30 seconds or less at scale. I'm fact, I'm pretty sure all the big players can do that.
It's simply too soon for the status page to report the anomaly, is my guess. It's been down for 4 minutes.
4 minutes is a long time for something that could have been an automated check.
For the record, the status page eventually got updated - around 7 minutes after this submission was created.
Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)
From my experience this requires a few steps happen first:
- an incident be declared internally to github
- support / incident team submits a new status page entry (with details on service(s) impact(ed))
- incident is worked on internally
- incident fixed
- page updated
- retro posted
Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.
I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.
They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!
It went down literally 3 minutes ago (I was in the middle of writing a PR comment), let's see if their cron job kicks in and reports the issue.
it's starting to show now, about 10 minutes after the issue started
It's showing a few incidents now. Some things are still green though that don't seem to be working.
The timing is pretty uncanny. I just deployed a github page and had a DNS issue because I configured it wrong. I hit "check again" and github went down.
Hope I don't appear in the incident report.
Wait. You use github pages for something or actually work on it?
I use it for something.
I had a github page that was public, but it was made private and the DNS config was removed. Fast forward to today. I made the private repo public again and forced a deploy of the page without making a new commit. It said the DNS config was incomplete, so I tweaked it and hit "check again" and github went down.
Probably unrelated, but the timing was spooky.
Your domain isn't `null.example.com` or something, is it?
Sorry for the offtopicness - would you mind emailing hn@ycombinator.com so I can check in with you about a couple things regarding https://news.ycombinator.com/item?id=41221186?
So it was you who crashed GitHub?
Bad bitbasher bad! :catbonk:
Perhaps this is a repeat of the Fastly incident with a customer's Varnish cache configuration causing an issue in their systems (I think this is a rough summary, I don't remember the details).
So, you're both responsible and not responsible at the same time :)
Hope I don't appear in the incident report.
Appearing in an incident report with your HN username could be pretty funny...
This will all clear up when it finishes checking your DNS configuration I bet.
Fwiw, GitHub Pages is down too. The hosted Pages sites are down.
Love that HN is a better status page for dev services than most companies can manage to provide. Knew I'd find it here but on the front page within 3 minutes is impressive.
I guess when GitHub goes down, it is somehow strangely tolerated for years even after the acquisition and goes down more times than Twitter. When the latter encounters a speed-bump, just like the 'interview' with Trump, it's global news because a Mr Elon Musk owns it.
Both seem to be doing too much all at once. But really it is worse with Github if this is what Microsoft stewardship is incidents every, single week and each month guaranteed for years.
Anyways. #hugops for the GitHub team.
What makes you say it’s “somehow strangely tolerated” when GitHub goes down?
What’s the point of bringing up twitter? It is strange to seek victimhood for a petulant billionaire. Of course, it is worse with GitHub because GitHub actually provides useful functionality.
What makes you say it’s “somehow strangely tolerated” when GitHub goes down?
The same folks complaining about something at GitHub going down are the same people that stay and are willing to tolerate the regular incidents and chaos on the site.
It is the fact that not only the Github incidents have been happening for years, it has gotten worse as there is an incident every month.
Of course, it is worse with GitHub because GitHub actually provides useful functionality.
That isn’t an excuse for tolerating regular downtime for a site with over 100m+ active users, especially with it running under Microsoft stewardship who should know better.
Any other site with that many users and with a horrendous record of downtime like Github would be rightfully branded as unreliable. No excuses.
HN need to publish their secrets on how they rarely goes down!
Based on what I've read in the past, I believe the secret is simplicity. Simplicity scales
Except that the weird HN algo just saw 187 upvotes in 15 minutes, and dropped this thread to the second page...
it knows we're all in a voting ring called Github Users
Reminds me of a repository I once found when searching for Prometheus exporters.
It stated this but with Twitter; it will monitor latest tweets searching for a custom word combo and raise a server alert when found. I found it hilarious. Will post the source once GitHub is back on.
I see more and more people use less Github, but some other git solutions. I am afraid to think what to do when GitHub is down for hours (need to learn maillists?).
Another reason is that MS may be in phase when it will ask to pay for using GitHub just for reads (rate limiter).
I recently looked into using Git in a decentralized way. It's actually pretty easy!
When you would usually create a PR, you use `git format-patch` to create a patch file and send that to whoever is going to merge it.
They create a branch and use `git am` to apply the patch to it, review the changes, and merge it to main.
It is nice that git supports multiple remotes, though. It feels good to know that `git push` might not work for my project right now, but I know `git push srht` will get the code off of my laptop.
I used to work at a company with very draconian policies. Whenever I needed to update some code on a public GitHub repository, I would just push to a remote that was a flash drive. Plug it in my machine at home, pull from that remote, push to origin.
I also had to setup a bidirectional mirror back when bandwidth to some countries was restricted. We would push and pull as normal, and a job would keep our mainline in sync.
It is sad that most organizations forget that git is distributed by nature. We often get requests to setup VPNs and all sorts of craziness, when a simple push to a bare mirror would suffice. You don't even need anything running, other than SSH.
Draconian policies...but not security ones? Why were USB drives not blocked?
I recently looked into using Git in a decentralized way. It's actually pretty easy!
Well, that's how it was designed to work! The whole point of Git is that it's a distributed version control system, and doesn't need to rely on a centralized source of truth.
emailing patches is fairly easy.
The real reason not to use github anyway though is that it's terrible (the basic "github model" for doing code review was basically made up on the back of a napkin IMO)
Git without github is pretty much the same as with it. It's just PRs that are different.
Investigating - We are investigating reports of degraded availability for Actions, Pages and Pull Requests Aug 14, 2024 - 23:11 UTC
Update - Pages is experiencing degraded availability. We are continuing to investigate. Aug 14, 2024 - 23:12 UTC
Update - Copilot is experiencing degraded availability. We are continuing to investigate. Aug 14, 2024 - 23:13 UTC
Update - We are investigating reports of issues with GitHub.com and GitHub API. We will continue to keep users updated on progress towards mitigation. Aug 14, 2024 - 23:16 UTC
EDIT: The reply link is no longer available.
Update - Packages is experiencing degraded availability. We are continuing to investigate. Aug 14, 2024 - 23:18 UTC
The reply link is now available?
Update - Issues is experiencing degraded availability. We are continuing to investigate. Aug 14, 2024 - 23:19 UTC
Update - Git Operations is experiencing degraded availability. We are continuing to investigate. Aug 14, 2024 - 23:19 UTC
EDIT: The reply link is no longer available again.
The reply link is now available again?
Everything is red now. Nearly lunch time in New Zealand.
RIP to all those who host their websites on GitHub pages :(
Anybody who publishes an app on the Google Play store and hosts their privacy policy on Github pages may have their app taken down because Google's bots won't be able to verify it exists.
That happened to me a while back with an app listing that was almost 10 years old because the server I was hosting the policy on went down. Ironically, I switched it to Github pages so it wouldn't happen again.
I have my client's app policy on GitHub. I have to check if anything happened to it. The websites are working fine
First time in a while that pages goes down with github itself, it's usually separate from the main site.
I checked, and my pages-based site was down, but it is back up now.
Could it have been brought down intentionally? Related to this?
https://www.bleepingcomputer.com/news/security/github-action...
How would customer credentials being leaked be part of an outage of this size ?
If its enough of a security issue they could have pulled the site while its fixed/cleaned
Because there are worse things than being down; if the front page got hacked and is spewing gore or CSAM or PII or creds, for example.
Seems like it was a config change that cause it. They reverted it reality quickly.
given that it seems like the entire thing is busted, can anyone explain how the unicorn page is being served?
They probably have a reverse proxy in front of all their http endpoints and that is still up and able to show the unicorn if the backends aren't responsive.
The static content on the error page might also be on akami or cloudflare side.
makes sense, thanks.
the images on the page are all just base64 encoded right into the html
https://github.blog/news-insights/the-library/unicorn/
Unicorn has a slightly different architecture.
Instead of the nginx => haproxy => mongrel cluster setup
you end up with something like: nginx => shared socket => unicorn worker pools
When the Unicorn master starts, it loads our app into memory. As soon as it’s ready to serve requests it forks 16 workers. Those workers then select() on the socket, only serving requests they’re capable of handling. In this way the kernel handles the load balancing for us.
amazing, thanks!
I swear even my VSCode intellisense is broken now... Rip to a real one.
yep very strange. You can disconnect from Wifi to get it to work. Vscode probably keeps pinging github/microsoft before every operation.
You can also disable telemetry and that seems to work too. Settings-> search for telemetry and select "off" from the telemetry dropdown.
I love how the same people who try to drag me towards using Git are the only people who seem to have serious problems working on their code when a website goes down.
Git is not the same thing as github. It's designed to be decentralized, even if it isn't getting used that way atm
I am quite familiar with the basic functionality of Git. However, I am always amused by how it works in practice.
Status page, like usually all green -> https://www.githubstatus.com/
would be better if it's down too :D
It's yellow/red now.
for everyone complaining about the status page - status pages are normally operated by hand by design, and will rarely reflect things in real-time.
give the poor github ops folks a second to get things moving.
Most status page products integrate to monitoring tools like Datadog[1], large teams like github would have it automated.
You ideally do not want to be making a decision on whether to update a status page or not during the first few minutes of an incident, bean counters inevitably tend to get involved to delay/not declare downtime if there is a manual process.
It is more likely the threshold is kept a bit higher than couple of minutes to reducing false positives rates, not because of manual updates.
[1] https://www.atlassian.com/software/statuspage/integrations
Nah, _most_ status pages are hand updated to avoid false positives, and to avoid alerting customers when they otherwise would not have noticed. Very, very few organizations go out of their way to _tell_ customers they failed to meet their SLA proactively. GitHub's SLA remedy clause even stipulates that the customer is responsible for tracking availability, which GitHub will then work to confirm.
Me: I think I'll update nixos*
Nix: barfs voluminous errors I've never seen before
Me: whaaaat the farrrrk
* nixos updates are pulled from a github repo
Yeah, need more caches and backup git links (including local clones).
Also they had IPFS attempts, but not finished.
They're pulled from our CDN by default. Only if you use experimental flakes is GitHub in the loop. And even if GitHub isn't down you can't pull nixpkgs more than twice per hour without running into rate limits and get your IP banned. Don't rely on GitHub for critical infrastructure.
I'm wondering why this isn't on the front page? It has a lot of points in 23 minutes.
HN has a strange philosophy built into its ranking algorithm that an item with a large number of comments early on should be de-ranked because the conversation is likely to be of poor quality.
I wonder how much it would've taken to keep github running without Microsoft buying them and running them into the ground like this.
I wonder if MS long term plans to have both GH and Azure DevOps (the source code management part) ...
It's 00:16, just about to go to bed, I ran `git push` and it's not working. Check Github, says it's down, I think it's only me, maybe I'm blocked, Github can't be down. Come here to check and it's down for everyone, such a relief.
Really it was a relief. Same case for me. Now I have no energy to push my code. Tomorrow maybe
@dang https://www.githubstatus.com/incidents/kz4khcgdsfdv is probably a better link for this submission now
I should have looked for this before posting the same comment. Upvoted :)
A coworker and I just had to use `git format-patch` and `git am` to exchange work. Git is super cool!
`git bundle` is another option for this (I'm not trying to imply it's preferable)
Would this explain why "npm install next-sanity" doesn't work properly, or am I hitting a user error?
could be if the package is hosted on github
I am happy that my project is pushed also to codeberg.
And has a website so anyone could just ask me if something went wrong on github's side and I can send them a complete copy. Decentralised version control is nice!
This is a pretty good place to check. The lag is pretty minimal traditionally.
At the time of posting everything is broken.
Not sure what you mean by minimal lag. The status page showed all green for at least 10min while everyone got unicorns.
There goes Pages, there goes the CDN for release artifacts, there goes any package manager hosting repositories on GitHub. Is this outage just contained to github or is it an Azure outage?
It looks to be contained to just GitHub, azure service page shows no outages at this time.
Down in Australia as well
Down under?
Feels bad to have one's job interrupted. Looking on the bright side this is the excuse I needed to check out Radicle...
checked out radicle: doesn't do windows
Aug 14, 2024 - 23:29 UTC Update - We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
Feels like Github is down more often than Gitlab
I sure wish this had happened before I logged off from work for the day...
"Why isn't this project done yet?"
"Didn't you hear? GitHub is down!"
and I get to go out for a long lunch
Down in the Dominican Republic as well, was just trying to commit and end my day
Down for me as well. Thought my SSH agent was broken.
I've never seen such a serious outage before. Even GitHub Pages hosted sites aren't accessible.
Yep
Its all down according to https://www.githubstatus.com/
Update - Issues is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:19 UTC
Update - Git Operations is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:19 UTC
Update - Packages is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:18 UTC
Update - Copilot is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:13 UTC
Update - Pages is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:12 UTC
Wtf, thought it was me alone!
Welp, that’s as good a time as any to call it a day!
Good luck to the devs and dev-analogues involved in getting the ship righted.
This is my first time seeing the angry unicorn! Hopefully it’ll be gone soon :(
This will have a fun post mortem
Wonder if they’ve had worse uptime after moving to Azure
Yes, github is not working right now
Seem to be up again, I also wonder what is was.
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back. hope it is back up soon
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back. hope it comes back soon
Yep, angry unicorn. If the copilot debacle wasn't reason enough to make people migrate or diversify the code repo efforts with, let's say, GitLab, this should.
And so goes all your packages, private repositories, pages, AI intern copilot bot and Github Actions; and soon your AI models once you host them there - all being unavailable and going down with GitHub.
Time to consider self-hosting like the old days instead of this weekly chaos at GitHub.
down in phx az
it crashed the second i opened this old plugin for blockbench
it crashed the second i opened a github rep for an old plugin for blockbench
Sure looks like it!
and... we're back, at least in Japan region
down for me
Yet again, this shows how useless GitHub status page is.
"We suspect the impact is due to a database infrastructure related change that we are working on rolling back."
@dang maybe the link could now be updated with this one https://www.githubstatus.com/incidents/kz4khcgdsfdv
Can't wait for the writeup! So many services down at once... Something very interesting must have happened
Wonder how much of this is to blame on copilot generated code lol
Maybe it's fixed already? Works for me.
Latest update at 23:29 UTC says: "We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back."
and I was in the middle of commiting a hot fix D:, I had to push the image directly to the registry D:
This reminds me that for some reason I am logged into my gaming machine's windows store with my GitHub account thanks to the bizarre way that microsoft do auth.
Even GitHub-hosted Pages are down — https://prql-lang.org/ is also a unicorn
just wanted to do one last final commit :D good timing
All the AI-native developers are twiddling their thumbs because Copilot is out of office.
Imagine everyone having their site on gh pages, and artifacts for OS startup on here. Hello, NixOS...
Imagine everyone having their site hosted on gh pages. Now imagine artifacts for system update on github. Hello NixOS filesystem!
Time to go out and see people
Back online for me
Yes it went down about 5 mins ago, I got the angry unicorn. Since then the status page is increasingly red.
Seeing it all kind of went sideways at the same time, my money is on the typical load balancer config rollout snafu.
"As part of a routine configuration deploym..." [splat]
Down Detector link:
The status page should have a button "Report Outage".
Wonder if this is related to the big cyberattack on Iran earlier today https://www.jpost.com/breaking-news/article-814715
They were manually ran a hot patch on the distributed production database and forgot to use a transaction
They manually ran a patch query on the distributed production database and forgot to use a transaction
curious if their layoffs last year had the intended impact
The macho unicorn is kinda dope though.
Seems like sites based on GH Pages were down, but are back up (i.e. the Rust blog).
It feels so wrong that there are so many blogs and websites that are based on GH Pages and they all died at once…
Seems like they’re back up though. Or at least the Rust blog is back up.
on X @githubstatus seems to be getting regular updates / automated messages around impact
Seems to be back online.
They should have used Kardinal: https://github.com/kurtosis-tech/kardinal
Nothing useful on the status page: https://www.githubstatus.com/
The mobile app on iOS is a 503 with
```
Received a 503 error. Data returned as a String was: <!DOCTYPE html> <!- -
Hello future GitHubber! I bet you're here to remove those nasty inline styles, DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.co...
```
That's where it's cut off on my screen.
Curious what the link is :)
I like to think, someone did.
I think it maybe global but at least it is down in US-East (for sure).
Who's the Bozo Doofus maintainer? https://yhbt.net/unicorn/LATEST. I love that we can still see Unicorn in action. I rarely had problems with it back in the day.
down! unicorn!
fatal: unable to access: 502
cli, web, and iOS app :-/
Thank goodness for HN status reports.
Sure is.
Services that explicitly needed the API were also down, and it wasn't pretty. For example: Minecraft Mod packs that rely on SerializationIsBad all went kerplunk! I'm sure a lot of people were scratching their heads yesterday wondering why they couldn't do anything for a time.
What made me laugh though was when the "X is functioning normally" immediately followed by "X is degraded, continuing to monitor" messages that kept popping up then right back to "normal" again, all in the same 30 second timespan... made me giggle
Cause seems to be database related per most recent update (23:29 UTC):
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back. Aug 14, 2024 - 23:29 UTC
Status page showing a complete outage now
time to go to bed then. Wasn't getting any useful work done any more anyways...
In the last 5 minutes too, wow.
And other services like copilot...
[error] [auth] Response content-type is text/html; charset=utf-8 (status=503)
Yes : https://github.com/psf/black/ is 502
back up
Yes, everything is down https://www.githubstatus.com/
This is weird. I've been using Github all night (in France) and didn't notice anything was wrong. Was the outage in North America?
back to live
totally down -.- Cannot access from Hong Kong OMG...
totally down -.-
So that's why I haven't heard back on my applications
Things seems to be ack'ed: ``` Investigating - We are investigating reports of degraded availability for Actions, Pages and Pull Requests Aug 14, 2024 - 23:11 UTC ```
GH Ops team be like
Senior: Ah found it! Let's just rollback one revision on the db. Newguy: let me fix this! `kubectl rollout undo ... --to-revision=1` Newguy: Ok, Started rollback to revision one! Senior: Uh-oh..
Down in Vancouver, Canada
00:30 UTC Resolved https://www.githubstatus.com/incidents/kz4khcgdsfdv
What'll it be this time? DNS or BGP?
Interesting that CoPilot is down as well. I would have assumed it was really only part of GitHub as a branding/marketing thing.
Github mirrors?
...and we're back
I wonder why the status just doesn't ping github.com for 200. That seems easy to do.
delaying SLA
This is at least a multi-million dollar payout (if they admit to it).
All GitHub Pages say
Seems slightly unproffesional for a massive company like Github/Microsoft.
I disagree. This hurts no one, and not everything needs to be sanitized and painted over with bland corporatespeak.
I don't think they were asking for corporate speak. But at least I would find a plain technical error message like "cannot contact file server" much more respectable than something like "unicorns are hugging our servers uwu".
This “ironic” and “humorous” style of errors and UI captions is the actual new corporate speak. I’d prefer dumb error messages rather than some shit someone over the ocean thinks is smart and humorous. And it’s not funny at all when it’s a global outage impacting my business and my $$$.
It's closer to the truth than you usually get. They're having a bad day, it's completely true. It's the start of my day, but I guess this is the middle of the night for them. There's no such thing as unicorns, but that just highlights the metaphorical nature of the remaining claim - getting Unicorns under control means solving their problems. Normally "professional" corporate speak means avoiding saying anything whose meaning is plain on its face and disconfirmable while avoiding the implication that the company is run and operated by humans. This is a model. (Obviously the came up with the message in advance, which just goes to show that someone in the company is well enough rounded to know that if it is displayed, they're having a bad day.)
GitHub is (was?) a Rails application, so it was probably originally running behind Unicorn [0], if it isn’t still. So the unicorns are (were) real.
[0] https://en.wikipedia.org/wiki/Unicorn_(web_server)
At the moment, all github services seem to be restored, and the github status indicates that the problem is still ongoing. I don't think it's related to the SLA, but rather to the monitoring, which is not live. There are a few minutes of delay.
from where? they don't only have one load balancer, so you'd still have the problem of the page showing green when it's not loading for some folk?
At Github's scale, why wouldn't they put a ping monitor from every continent at least?
Then, you would show the status based on the continent.
Where on the continent? GitHub is undoubtedly doing blackbox testing internally and has multiple such monitors but that's not going to capture every customer's route to them, leading to the same problem - customers experience GitHub being down, despite monitoring saying it's mostly up. Thus the impass. Even doing whitebox testing, where you know the internals and can this place sensors intelligently, even just for ingress, you're still at the mercy of the Internet.
If a sensor that's basically in the same datacenter says you're up, but the route into the datacenter is down, then what? multiply this by the complexity of the whole site, and monitoring it all with 100% fidelity is impossible. Not that it's not worth it to try, there's a team at GitHub that works on monitoring, but beyond motivation about keeping the SLA up, as a customer, unless you notice it's down, is it really down? In a globally distributed system, downtime, except for catastrophic downtime like this, is hard to define on a whole-site basis for all customers.
I don't think anybody asked for 100% fidelity. We are talking about a complete outage that affected at least North America and Europe. If the status page shows green in such a case, its fidelity is around 50%. People expect better from GitHub.
This is impossible regardless of how godlike the design is... Nobody is asking for 100% fidelity.
That would be self-defeating given that it's a Rails app.
To be fair - I really couldn't care less is the homepage is loading or not.
So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.
(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)
I have to wonder how a company at the scale of GitHub can be so bad at keeping track of their status.
Now 4 out of 10 services are marked as "Incident", yet most of the others are also completely dead.
It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.
This defeats the purpose of a status dashboard and is effectively useless in practice most of the time from a consumers point of view.
From a business perspective, I think given the choice to lie a little bit or be brutally honest with your customers, lying a bit is almost always the correct choice.
My ideal would be if regulations which made it necessary that downtime metrics had to be reported with at most somewhere between a 10m and 30m delay as "suspected reliability issue".
If your reliability metrics have lots of false positives, that's on you and you'll have to write down some reason why those false positives exist every time.
Then that company could decide for itself whether to update manually with "not a reliability issue because X".
This lets consumers avoid being gaslighted and businesses don't technically have to call it downtime.
Liability is their primary concern
This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.
That it may be but there’s no excuse.
Declare an incident first, investigate later.
Cheating SLAs by delaying the incident is a good way to erode trust within and without.
If that would be the best way to deal with it- why is literally no one doing it this way and what does that tell you?
because it involves admitting that you messed up which companies are often disensentivized to do
False positives?
I get the angry unicorn page "No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists. Contact Support — GitHub Status — @githubstatus" with that last link going to https://x.com/githubstatus showing "GitHub Status Oct 22, 2018 Everything operating normally."
The era of Twitter/X status pages needs to come to an end given how unusable it is if you aren't logged in.
Making logins required to view twitter was the ultimate bed shitting move. The whole point of twitter was to be a broadcast medium. Tweets were viewable without following or logging in. There is a huge vacuum in that space now.
For most (social media) platforms really. Management believes it would force users to sign up, but in reality the platform just becomes less relevant because of that limitation. Not even talking about search crawlers.
An all around stupid decision. That said, if management is that shitty, the platform probably won't be attractive for long anyway.
Facebook/Instagram were successful despite that to a degree, but this decision probably still did a lot of damage to their relevancy and user numbers.
FB/IG/Whatsapp have half of humanity logging into their services once per month, so I'm not sure how much better they could be doing if they didn't have a login wall.
Meanwhile, Twitter (with no login wall) never broke 500mn. Like, personally I totally take your point about status updates but I'd have used my Twitter account a lot more if I'd needed to log in to see the content.
I think this is because logged-out Twitter now shows top Tweets of all time from a user, rather than most recent Tweets.
Good reason why companies shouldn't be using Twitter/X for status updates anymore!
Thank you! I was wondering why all I could see was useless content there!
Use https://xcancel.com/ (eg https://xcancel.com/githubstatus)
Used to work ops at AWS. I don't know if it's still the case but it required VERY HIGH management approval to actually flip any lights on their "status page" (likely it was referenced in some way for SLAs and refunding customers).
That is an excellent illustration to Goodhart's law. We're going to have this avesome status page, but since if we update it the clients would notice the system is down, we're going to put a lot of barriers to putting the actual status on that page.
Also probably a class action suit lurking somewhere in there eventually.
FWIW, our self-hosted Gitea instance has not had a single second of unplanned downtime in five years we've been running it. And there wasn't much _planned_ downtime because it's really easy to upgrade (pull a new image and recreate the container — takes out the instance for maybe 15 seconds late at night), and full backups are handled live thanks to zfs.
Migration to a new host takes another 15 seconds thanks to both zfs and containers.
I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.
I've been running Gitea on my homelab for a few months now. It's fantastic. It's like a snapshot of a point in time when GitHub was actually good, before it got enshittified by all of the social and AI nonsense.
I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.
Looks like we have a full house outage at GitHub with everything down. Much worse than the so-called Twitter / X recent speed-bump that was screeched at and quickly forgotten.
I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.
I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]
[0] https://news.ycombinator.com/item?id=22868406
statute of limitations for HN comment predictions is 3 years.
Wow, the status page only just now started reporting issues, and it still doesn't seem to communicate the scale of the issue.
People use this page for guidance. I guess now we know how much it can be trusted.
It’s used to ease their comms, not a real time status board pointing at their monitoring.
Status page updates with "degraded availability". lol
They are flipping the switches now, status page just changed.
I remember a time when systems would boast about their "five nines" uptime. It was before anything "cloud" appeared.
Twitter now has:
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
https://x.com/githubstatus/status/1823864449494569023
Github seems to be coming back up:
https://downdetector.com/status/github/