This is a handy list.
4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…
Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.
Dig into machine off the hot path.
Unfortunately, no one has the time to do that (or let someone do it) after the problem is "solved", so over time the "rebuild from scratch" approach just results in a loss of actual troubleshooting skills and acquired knowledge --- the software equivalent of a "parts swapper" in the physical world.
The end state of a culture that embraces restart/reboot/clear-cache instead of real diagnoses and troubleshooting is a cohort of junior devs who just delete their git repo and reclone instead of figuring out what a detached HEAD is.
I don't really fault the junior dev who does that. They are just following the "I don't understand something, so just start over" paradigm set by seniors.
To be fair, with git, specifically, it's a good idea to at least clone for backup before things like major merges. There are lots of horror stories from people losing work to git workflow issues and I'd rather be ridiculed as an idiot who is afraid of "his tools" (as if I have anything like a choice when using git) and won't learn them properly than lose work thanks to a belief that this thing behaves in a way which can actually be learned and followed safely.
A special case of this is git rebase after which you "can" access the original history in some obscure way until it's garbage-collected; or you could clone the repo before the merge and then you can access the original history straightforwardly and you decide when to garbage-collect it by deleting that repo.
Git is a lot less scary when you understand the reflog; commit or stash your local changes and then you can rebase without fear of losing anything. (As a bonus tip, place “mybranch.bak” branches as pointers to your pre-rebase commit sha to avoid having to dig around in the reflog at all.)
I would never ridicule anyone for your approach, just gently encourage them to spend a few mins to grok the ‘git reflog’ command.
Then submodules enter the picture. I’m comfortable with reflog, but haven’t fully grokked submodules yet, easier to reclone.
If you’re not super comfortable with Git, before rebasing, simply:
- Commit any pending changes.
- Make a git tag at your current head (any name is fine, even gibberish).
If anything “goes wrong” you can rollback by simply doing reset hard to the tagged commit.
Once done, delete the tag.
Making a complete “backup clone” is a complete waste of time and disk space.
Isn't the whole purpose of GIT Version Control? In other words to prevent work loss occurring from mergers and/or updates? Maybe I'm confusing GitHub with GIT? PS I want to set up a server for a couple of domain names I recently acquired, it has been many years so I'm not exactly sure if this is even practical anymore. Way back when I used to distribution based off of CENT OS called SME server, is it still common place to use a all in one distribution like that? Or is it better to just install my preferred flavour of Linux and each package separately?
Honestly, there's a certain cost-benefit analysis here. In both instances (rebooting and recloning), it's a pretty fast action with high chances of success. How much longer does it take to find the real, permanent solution? For that matter, how long does it take to even dig into the problem and familiarize yourself with its background? For a business, sometimes it's just more cost effective to accept that you don't really know what the problem is and won't figure it out in less time than it takes to cop-out. Personally, I'm all in favor of actually figuring out the issue too, I just don't believe it to be appropriate in every situation.
There is a short term calculus and long term calculus. Restarting usually wins in the short term calculus. But if you double down on that strategy too much, your engineering team, and culture writ large, will lilt increasingly towards a technological mysticism.
It’s not either / or.
If you have proper observability in place then you can do your diagnosis without affecting your customers.
Plus, at the same time successful diagnosis is also the kind that can have the most dramatic effect on your customers.
In a positive way.
If it's happening so rarely that killing is a viable solution, then there's no reason to troubleshoot it to begin with. If it's happening often enough to warrant troubleshooting, then your concerns are addressed.
That might work in some scenarios. If you're a "newer" company where each application is deployed onto individual nodes, you can do this.
But consider that the case for older companies, where it was more common to deploy several systems, often complex ones, onto the same node. You will also cause outages to system x, y and z too. Maybe some of them are inter-dependent? You have to outwhey the consequences and risks carefully in any situation before rebooting.
At least as I read it, this contains the assumption that that‘s not how you deploy your applications
Here's a real-life example. We have a KVM server that has its storage on Ceph. It looks like KVM doesn't work well with Ceph, esp. when MD is involved, so, if a VM is powered off instead of an orderly shutdown, something bad is happening to MD metadata, and when the VM is turned on again, one MD replica can be missing. This happens infrequently, and I've never been in a situation when two replicas died at the same time (which would prevent a VM from booting), but it's obviously possible.
So... more generally, your idea with replacing VMs is rather naive when it comes to storage. Replacement incurs penalties, s.a. eg. RAID rebuilds. RAIDs don't have the promised resiliency during rebuild. And, in general, rebuilds are costly because they move a lot of data / wear the hardware by a lot. Worst yet, if you experience the same problem that caused you to start a rebuild in the first place during the rebuild, the whole system is a write-off.
In other words, it's a bad idea to fix problems without diagnosing them first if you want your system to be reliable. In extreme cases, this may start a domino effect, where the replacement will compound the problem, and, if running on rented hardware, may also be very financially damaging: there were stories about systems not coping with load-balancing and spawning more and more servers to try and mitigate the problem, where problem was, eg. a configuration that was copied to the newly spawned servers.
Y'all don't do post-mortem investigations / action items?
I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait. I once had to deal with an outage that required we kill all our app servers every 20 minutes (staggered of course) because of a memory leak while it was being investigated.
What numbers went into this calculation, to get such an extreme result as concluding that getting it up again is always the first priority?
When I tried to estimate the cost and benefit, I have been surprised to make the opposite conclusion multiple times. We ended up essentially in the situation of "Yeah, sure, you can reproduce the outage in production. Learn as much as you possibly can and restore service after an hour."
This is in fact the reason I prefer to keep some margin in the SLO budget -- it makes it easier to allow troubleshooting an outage in the hot path, and it frontloads some of that difficult decision.
Usually depends on the impact. If it's one of many instances behind a load balancer and was easily fixed with no obvious causes, then we move on. If it happens again, we have a known short-term fix and now we have a justified reason to devote man-hours to investigating and doing a post-mortem.
I was at a place where we had "worker" machines that would handle incoming data with fluctuating volume. If the queues got too long we would automatically spin up new worker instances and when it came time to spin down we would kill the older ones first.
You can probably see where this is going. The workers had some problem where they would bog down if left running too long. Causing the queues to back up and indirectly causing themselves to eventually be culled.
Never did figure out why they would bog down. We just ran herky jerky like this for a few years till I left. Might still be doing it for all I know.
So you just automatically replace the instances after a certain amount of runtime and your problem is gone.
Yeah, fixing a problem without understanding it has some disadvantages. It works sometimes, but the "with understanding" strategy works much more often.
Is this really a prevailing attitude now? Who cares what happened, as long as we can paper over it with some other maneuver/resources? For me it's both intellectually rewarding and skill-building to figure out what caused the problem in the first place.
I mean, I hear plenty of managers with this attitude. But I really expect better on a forum called hacker news.
One could argue that most devs these days are parts swappers with all the packages floating around.
"4:10pm the new machine still has the same performance issue"
4:20pm Turns out it was DNS
That made me laugh. Thank you. Of course, it is not DNS. DNS has become the new cabling. DNS is not especially complicated, but cabling is neither. Yet, during dot.com and subsequent years the cabling was causing a lot of the problems so that we get used to first check the cabling. But it only took a few more years to realize that it is not always cabling, actually failures are normally distributed.
Is it wrong to check DNS first? No, but please realize that DNS misconfiguration is not more common than other SNAFUS.
Sure, but more often than not - esp in cloud scenarios, sometimes you just get a machine that is having a bad day and it’s quicker to just eject it, let the rest of the infra pick up the slack, and then debug from there. Additionally if you’ve axed a machine, and got the same issue, you know it’s not a machine issue, so either go look at your networking layer or whatever configs you’re using to boot your machines from…
... so the nice thing about the about the cloud is that you can workaround cloud-specific issues?
That's actually amazing, a reproducible problem is a 90% solved problem!
Kill the machine might destroy evidence. It might be the case you have everything logged outside, but most often there is something missing.
Take it out of the pool then.
You're describing one of the benefits of virtualised cattle, not necessarily or exclusively 'cloud'.