return to table of content

Linux Crisis Tools

FridgeSeal
32 replies
16h29m

This is a handy list.

4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…

Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.

userbinator
21 replies
14h35m

Dig into machine off the hot path.

Unfortunately, no one has the time to do that (or let someone do it) after the problem is "solved", so over time the "rebuild from scratch" approach just results in a loss of actual troubleshooting skills and acquired knowledge --- the software equivalent of a "parts swapper" in the physical world.

patrick451
9 replies
14h13m

The end state of a culture that embraces restart/reboot/clear-cache instead of real diagnoses and troubleshooting is a cohort of junior devs who just delete their git repo and reclone instead of figuring out what a detached HEAD is.

I don't really fault the junior dev who does that. They are just following the "I don't understand something, so just start over" paradigm set by seniors.

yosefk
4 replies
13h47m

To be fair, with git, specifically, it's a good idea to at least clone for backup before things like major merges. There are lots of horror stories from people losing work to git workflow issues and I'd rather be ridiculed as an idiot who is afraid of "his tools" (as if I have anything like a choice when using git) and won't learn them properly than lose work thanks to a belief that this thing behaves in a way which can actually be learned and followed safely.

A special case of this is git rebase after which you "can" access the original history in some obscure way until it's garbage-collected; or you could clone the repo before the merge and then you can access the original history straightforwardly and you decide when to garbage-collect it by deleting that repo.

theptip
1 replies
12h50m

Git is a lot less scary when you understand the reflog; commit or stash your local changes and then you can rebase without fear of losing anything. (As a bonus tip, place “mybranch.bak” branches as pointers to your pre-rebase commit sha to avoid having to dig around in the reflog at all.)

I would never ridicule anyone for your approach, just gently encourage them to spend a few mins to grok the ‘git reflog’ command.

ansgri
0 replies
8h58m

Then submodules enter the picture. I’m comfortable with reflog, but haven’t fully grokked submodules yet, easier to reclone.

jonathanlydall
1 replies
2h47m

If you’re not super comfortable with Git, before rebasing, simply:

- Commit any pending changes.

- Make a git tag at your current head (any name is fine, even gibberish).

If anything “goes wrong” you can rollback by simply doing reset hard to the tagged commit.

Once done, delete the tag.

Making a complete “backup clone” is a complete waste of time and disk space.

ekimehtor
0 replies
2h21m

Isn't the whole purpose of GIT Version Control? In other words to prevent work loss occurring from mergers and/or updates? Maybe I'm confusing GitHub with GIT? PS I want to set up a server for a couple of domain names I recently acquired, it has been many years so I'm not exactly sure if this is even practical anymore. Way back when I used to distribution based off of CENT OS called SME server, is it still common place to use a all in one distribution like that? Or is it better to just install my preferred flavour of Linux and each package separately?

zettabomb
1 replies
13h39m

Honestly, there's a certain cost-benefit analysis here. In both instances (rebooting and recloning), it's a pretty fast action with high chances of success. How much longer does it take to find the real, permanent solution? For that matter, how long does it take to even dig into the problem and familiarize yourself with its background? For a business, sometimes it's just more cost effective to accept that you don't really know what the problem is and won't figure it out in less time than it takes to cop-out. Personally, I'm all in favor of actually figuring out the issue too, I just don't believe it to be appropriate in every situation.

patrick451
0 replies
13h15m

There is a short term calculus and long term calculus. Restarting usually wins in the short term calculus. But if you double down on that strategy too much, your engineering team, and culture writ large, will lilt increasingly towards a technological mysticism.

hnlmorg
1 replies
7h33m

It’s not either / or.

If you have proper observability in place then you can do your diagnosis without affecting your customers.

fuzzfactor
0 replies
55m

diagnosis without affecting your customers.

Plus, at the same time successful diagnosis is also the kind that can have the most dramatic effect on your customers.

In a positive way.

Salgat
3 replies
13h0m

If it's happening so rarely that killing is a viable solution, then there's no reason to troubleshoot it to begin with. If it's happening often enough to warrant troubleshooting, then your concerns are addressed.

whirlwin
1 replies
12h37m

That might work in some scenarios. If you're a "newer" company where each application is deployed onto individual nodes, you can do this.

But consider that the case for older companies, where it was more common to deploy several systems, often complex ones, onto the same node. You will also cause outages to system x, y and z too. Maybe some of them are inter-dependent? You have to outwhey the consequences and risks carefully in any situation before rebooting.

bschne
0 replies
11h33m

Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this.

At least as I read it, this contains the assumption that that‘s not how you deploy your applications

crabbone
0 replies
4h28m

Here's a real-life example. We have a KVM server that has its storage on Ceph. It looks like KVM doesn't work well with Ceph, esp. when MD is involved, so, if a VM is powered off instead of an orderly shutdown, something bad is happening to MD metadata, and when the VM is turned on again, one MD replica can be missing. This happens infrequently, and I've never been in a situation when two replicas died at the same time (which would prevent a VM from booting), but it's obviously possible.

So... more generally, your idea with replacing VMs is rather naive when it comes to storage. Replacement incurs penalties, s.a. eg. RAID rebuilds. RAIDs don't have the promised resiliency during rebuild. And, in general, rebuilds are costly because they move a lot of data / wear the hardware by a lot. Worst yet, if you experience the same problem that caused you to start a rebuild in the first place during the rebuild, the whole system is a write-off.

In other words, it's a bad idea to fix problems without diagnosing them first if you want your system to be reliable. In extreme cases, this may start a domino effect, where the replacement will compound the problem, and, if running on rented hardware, may also be very financially damaging: there were stories about systems not coping with load-balancing and spawning more and more servers to try and mitigate the problem, where problem was, eg. a configuration that was copied to the newly spawned servers.

Spivak
2 replies
13h37m

Y'all don't do post-mortem investigations / action items?

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait. I once had to deal with an outage that required we kill all our app servers every 20 minutes (staggered of course) because of a memory leak while it was being investigated.

kqr
0 replies
7h22m

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait.

What numbers went into this calculation, to get such an extreme result as concluding that getting it up again is always the first priority?

When I tried to estimate the cost and benefit, I have been surprised to make the opposite conclusion multiple times. We ended up essentially in the situation of "Yeah, sure, you can reproduce the outage in production. Learn as much as you possibly can and restore service after an hour."

This is in fact the reason I prefer to keep some margin in the SLO budget -- it makes it easier to allow troubleshooting an outage in the hot path, and it frontloads some of that difficult decision.

Salgat
0 replies
12h58m

Usually depends on the impact. If it's one of many instances behind a load balancer and was easily fixed with no obvious causes, then we move on. If it happens again, we have a known short-term fix and now we have a justified reason to devote man-hours to investigating and doing a post-mortem.

Fatnino
2 replies
10h32m

I was at a place where we had "worker" machines that would handle incoming data with fluctuating volume. If the queues got too long we would automatically spin up new worker instances and when it came time to spin down we would kill the older ones first.

You can probably see where this is going. The workers had some problem where they would bog down if left running too long. Causing the queues to back up and indirectly causing themselves to eventually be culled.

Never did figure out why they would bog down. We just ran herky jerky like this for a few years till I left. Might still be doing it for all I know.

aequitas
1 replies
5h9m

The workers had some problem where they would bog down if left running too long.

So you just automatically replace the instances after a certain amount of runtime and your problem is gone.

lucianbr
0 replies
4h33m

Yeah, fixing a problem without understanding it has some disadvantages. It works sometimes, but the "with understanding" strategy works much more often.

Is this really a prevailing attitude now? Who cares what happened, as long as we can paper over it with some other maneuver/resources? For me it's both intellectually rewarding and skill-building to figure out what caused the problem in the first place.

I mean, I hear plenty of managers with this attitude. But I really expect better on a forum called hacker news.

eversincenpm
0 replies
3h48m

One could argue that most devs these days are parts swappers with all the packages floating around.

throw5323446
6 replies
16h22m

Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one.

"4:10pm the new machine still has the same performance issue"

jandrese
2 replies
15h45m

4:20pm Turns out it was DNS

Propelloni
1 replies
14h3m

That made me laugh. Thank you. Of course, it is not DNS. DNS has become the new cabling. DNS is not especially complicated, but cabling is neither. Yet, during dot.com and subsequent years the cabling was causing a lot of the problems so that we get used to first check the cabling. But it only took a few more years to realize that it is not always cabling, actually failures are normally distributed.

Is it wrong to check DNS first? No, but please realize that DNS misconfiguration is not more common than other SNAFUS.

ninkendo
0 replies
6h53m

    It’s not DNS
    There’s no way it’s DNS
    It was DNS

FridgeSeal
1 replies
16h8m

Sure, but more often than not - esp in cloud scenarios, sometimes you just get a machine that is having a bad day and it’s quicker to just eject it, let the rest of the infra pick up the slack, and then debug from there. Additionally if you’ve axed a machine, and got the same issue, you know it’s not a machine issue, so either go look at your networking layer or whatever configs you’re using to boot your machines from…

tjoff
0 replies
10h1m

esp in cloud scenarios

... so the nice thing about the about the cloud is that you can workaround cloud-specific issues?

SerCe
0 replies
14h29m

That's actually amazing, a reproducible problem is a 90% solved problem!

KingOfCoders
1 replies
12h10m

Kill the machine might destroy evidence. It might be the case you have everything logged outside, but most often there is something missing.

monkpit
0 replies
11h45m

Take it out of the pool then.

Jedd
0 replies
15h48m

You're describing one of the benefits of virtualised cattle, not necessarily or exclusively 'cloud'.

rr808
9 replies
15h6m

You guys get root access? I have to raise a ticket for a sysadmin to do anything.

zer00eyz
7 replies
14h38m

I am a consultant now so it's a new company every few months.

There are groups of people you always make nice with.

* Security people. The kinds with poorly fit blazers who let you into the building. Learn these peoples names, Starbucks cards are your friends.

* Cleaning people. Be nice, be polite, again learn names. Your area will be spoltless. It's worth staying late every now and again just to get to know these folks.

* Accounting: Make some friends here. Get coffee, go to lunch, talk to them about non work shit, ask about their job, show interest. If you pick the right ones they are gonna grab you when layoffs are coming or corp money is flowing (hit your boss up for extra money times).

* IT. The folks who hand out laptops, manage email. Be nice to these people. Watch how quickly they rip bullshit off your computer or wave some security nonsense. Be first in line for every upgrade possible.

* Sysadmins. These are the most important ones. Not just because "root" but because a good SA knows how to code but never says it out loud. A good sysadmin will tell you what dark corners have the bodies and if it's just a closet or a whole fucking cemetery. If you learn to build into their platform (hint for them containers are how they isolate your shitty software in most cases) then you're going to get a LOT more leeway. This is the one group of people who will ask you for favors and you should do them.

rkachowski
3 replies
8h58m

Starbucks cards are your friends

like, how? are you straight up bribing people with coffee for security favors? or is it like, "hey man, thanks for helping me out I'd like to buy you a coffee but I'm busy with secret consulting stuff - here's a gift card"

Is this something that only works for short lived external consultant interactions?

zer00eyz
0 replies
3h25m

You learn peoples names, you say high every day you treat them like humans. You bring them coffee on occasion if it is early in the morning... or you ask them if they want something if your going for that post lunch pick me up for your self.

By the end of a 4-5 week run you will know all the security people in a building. If I go to lunch an forget my badge they will let me back in no questions asked. This is something I used to do as staff, and still do to this day.

sevagh
0 replies
4h7m

You do that right after walking into the manager's office and getting a job with a firm handshake. Then you go outside and buy a hotdog for 15 cents and a detached house in San Francisco for $15,000 USD.

awithrow
0 replies
2h20m

Just give one as a gift occasionally. The holiday season is great for this. On your way in, "Merry Christmas Frank!" and hand one out. Or even just because. "Keep up the good work, here you go"

Its not about bribing to get a specific favor. Its about getting on good terms. Having people like you is a good thing and it makes their job a bit better and can make their day a little brighter. win-win

ozim
1 replies
7h28m

Much easier to be nice as default ;D

phyzome
0 replies
3h37m

Agreed. Although being "extra nice" -- going out of your way to learn about people, eat with them, etc. -- does take extra time, so you can't do that with everyone.

Propelloni
0 replies
13h31m

So true. If you want to know anything about an office, ask the sysadmins. Double-plus on being nice to the facility managers, cleaning people and security. Not only do they do a thankless job but they are often the most useful and resourceful people around if you need something taken care of. They know how to get shit done.

Propelloni
0 replies
13h39m

Err, sure. I used to run IT ops (SYS, SRE, and SEC in this context). This article is directed at people who run apps on IT provided infrastructure. But if you would have interactions like in the example, your org failed on an org level, this is not a tech problem. We used to have very clear and very trustworthy lines of communication and people wouldn't be on chat, they would be on the phone (or today on Teams or whatever) with dev, ops, security, and compliance. Actually, we had at least a liaison on every team, but most often dev ran the apps on ops provided resources. Compliance green-lighted the setup and SR was a dev job. A lot of problems really go away if you do devops in this sense.

mmh0000
8 replies
16h17m

I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.

vram22
1 replies
10h0m

And I wonder if anyone still uses sar and family.

I didn't, but a boss or two of mine did.

tremon
0 replies
3h16m

I still use sar occasionally, but never as a troubleshooting tool. It's more for performance analysis than crisis mode.

kristjansson
1 replies
2h51m

TIL fuser. Thanks!

vram22
0 replies
6m

Welcome :)

slacka
1 replies
15h56m

Why don't recommend atop? When a system is unresponsive, I want a I want a high-level tool that immediately shows which subsystem is under heavy load. It should show CPU, Memory, Disk, and Network usage. The other tools you listed are great, once you know what the cause is.

brendangregg
0 replies
15h39m

My preference is tools that give a rolling output as it let you capture the time-based pattern and share it with others, including in JIRA tickets and SRE chatrooms, whereas top's generally clear the screen. atop by default also sets up logging and runs a couple of daemons in systemd, so it's more than just a handy tool when needed, it's now adding itself to the operating table. (I think I did at least one blog post about performance monitoring agents causing performance issues.) Just something to consider.

I've recommended atop in the past for catching short-lived processes because it uses process accounting, although the newer bpf tools provide more detail.

zer00eyz
5 replies
14h46m

The only thing I would add is nmap.

Network connectivity issues aren't always apparent in some apps.

sneak
4 replies
12h17m

screen/tmux byobu pv rsync and of course vim.

vram22
3 replies
10h5m

dd, echo * as a poor man's ls if ls is accidentally deleted, busybox, cpio, fsck and fsdb.

Used all of these and more, in Unix, not just Linux crisis situations.

sneak
2 replies
10h2m

Those are already there. We are talking about diagnostic and recovery tools that should be installed by policy, in advance, so that they are already in place to aid in emergencies.

vram22
1 replies
9h58m

Okay, my mistake. But busybox is not always already there, right? Installed, I mean? Not at a box right now.

bostik
0 replies
6h10m

Busybox has one big downside: the tools it provides tend to have a rather ... limited set of options available. The easy stuff you can do in a standard shell might not be supported.

reilly3000
4 replies
16h20m

In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:

Build a container with a one-liner:

docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOF

Run attached to the host network:

docker run -dP --net=host moremagic/docker-netstat

Run system tools attached to read host processes:

for sysstat_tool in iostat sar vmstat mpstat pidstat; do alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}" done unset -v sysstat_tool

Sure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.

blueflow
1 replies
16h18m

Is there a situation where apt cant download and install packages but docker can fetch new containers?

apt libs borked or something?

Smar
0 replies
15h10m

I would just decompress the .deb in such case. As a last resort, even a .rpm might work.

Of course handling dependencies by hand is annoying, but depending on situation it might be faster anyway.

xyst
0 replies
14h34m

Unless you are in an air gapped situation. Good luck pulling “Ubuntu” image!

supriyo-biswas
0 replies
13h57m

On that note I'd largely prefer if `busybox` contained more of these tools, it'd be very helpful to have a 1MBish file that I can upload into a server and run it there.

SamuelAdams
4 replies
16h26m

Would these tools still be useful in a cloud environment, such as EC2?

Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?

yla92
0 replies
16h13m

It's still useful in EC2 (or any other VM-based environments) and Docker containers, as long as you can install the necessary packages (if they are not installed by default). Because after all, there are "servers" underneath, even for the serverless apps, I suppose.

It's definitely harder for apps running in Lambda because we may not have access to the underlying OS. In such case, I kind of fallback to using the application level observability tools like Pyroscope (https://pyroscope.io). It doesn't always work for all the cases and have some overheads/set up but it's still better than flying and more useful than the Cloud Provider's provided metrics.

ranger207
0 replies
16h5m

IME there's always that one service that wasn't ever migrated to containers or lambdas is is off running on an EC2 somewhere, and nobody knows about it because it never breaks, but then the one time AWS schedules an instance retirement for it...

mdekkers
0 replies
16h18m

Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8.

There are plenty of services that don’t fit on k8s or Lambda. Not all pegs fit in those holes.

cpuguy83
0 replies
4h31m

Containers are just processes running on the host where the process has a different view of the world from the "host". The host can see all and do all.

js4ever
3 replies
7h34m

Let's add NCDU to the list, it's super usefull to find what is taking all the disk space

kqr
2 replies
7h32m

I keep forgetting about ncdu thanks to my old habit of du -ms * | sort -n. What is it I'm missing?

IshKebab
1 replies
5h29m

Lots. ncdu is a fully interactive file browser that also lets you delete files and directories without a rescan.

natebc
0 replies
3h32m

ncdu will also store results in an output file you can pull back and do analysis on. I've found this feature useful in some contexts.

pjmlp
2 replies
10h43m

The list is great, but only for classical server workloads.

Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.

And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.

cpuguy83
1 replies
4h34m

You don't need to have these tools in the container to troubleshoot the workload in a container.

pjmlp
0 replies
2h47m

You would be surprised, specially if developers didn't care about telemetry.

logifail
2 replies
11h32m

Doesn't one increase a system's attack surface area/privilege escalation risk by pre-installing tools such as these?

citrin_ru
0 replies
6h35m

How do you see an escalation using one of listed in the article tool (unless a binary has suid bit which you shouldn’t set if worried about security). Many of these tools provide convenient access to /proc - if an attacker needs something there they can read/write directly to /proc. Though in case of eBPF - disabled kernel support would reduce attack surface and if it disabled in the kernel’s user mode tools are useless.

c0l0
0 replies
7h9m

Usually (not by design, but by circumstance), if someone gains RCE on your systems, they can also find a way to bring the tools they need to do whatever they originally set out to do. It's the old "I don't want to have a compiler installed on my system, that's dangerous, unnecessary software!"-trope driven to a new extreme. Unless the executables installed are a means to somehow escalate privileges (via setuid, file-based capabilities, a too-open sudo policy, ...), having them installed might be a convenience for a successful attacker - but very rarely the singular inflection point at which their attempted attack became a successful one.

The times I've been locked in an ill-equipped container image that was stripped bare by some "security" crapware and/or guidelines and that made debugging a problem MUCH harder than it should have been vastly outnumber the times where I've had to deal with a security incident because someone had coreutils (or w/e) "unnecessarily" installed. (The latter tally is at zero, for the record.)

devsda
2 replies
14h19m

Not all servers are containerized, but a significant number are and they present their own challenges.

Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.

To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.

Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.

Too
1 replies
10h45m

A better way is to build a second image including the debug tools and a root-user, then start it with the prod-containers pid-namespace and network-namespace mounted.

Starting a second container is usually a good idea anyway, since you need to add a lot of extra flags like SYS_PTRACE capability, user 0 and --privileged for debuggers to work.

This way you don't need to restart the prod-container either, potentially loosing reproduction-evidence.

Remembering how to do all this in an emergency may not be entirely obvious. Make sure to try it first and write down the steps in your run books.

devsda
0 replies
1h1m

A better way is to build a second image including the debug tools and a root-user.

That was our initial idea. But management and QA are paranoid enough that they consider these as new set of images that require running the complete test suite again even when they are built on top of certified images. Nobody is willing to test twice, so we had to settle for this middle.

randomgiy3142
1 replies
15h46m

I use zfsbootmenu with hrmph (https://github.com/leahneukirchen/hrmpf). You can see the list of packages here (https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).

I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

semi-extrinsic
0 replies
7h49m

On compute clusters there are quite a few "exotic" things that can go wrong. The workload orchestration is typically SLURM, which can throw errors and has a million config options to get lost in.

Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.

Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.

Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.

pstuart
1 replies
16h31m

Sounds like it's time to create a crisis-essential package group a la build-essential.

yjftsjthsd-h
0 replies
14h49m

I have in the past created a package list in ansible/salt/chef/... called devops_tools or whatever to make sure we had all the tools installed ahead of time.

washadjeffmad
0 replies
4h57m

And I haven't needed to use it in fifteen years. Over the past four or five years, I've ported what I can to a *BSD, for sanity reasons.

ur-whale
0 replies
10h23m

Can't imagine handling a Linux crisis without ssh

[EDIT]: typo

sirwitti
0 replies
11h8m

Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.

This probably would have prevented a stressful incident 3 weeks ago.

sargun
0 replies
13h14m

When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)

These were a lifesaver multiple times.

prydt
0 replies
15h45m

Love the list and the eBPF tools look super helpful.

michaelhoffman
0 replies
3h44m

When would you need to use rdmsr and wrmsr in a crisis?

kureikain
0 replies
10h22m

I don't see nmap, netstat, and nc being mention. They had saved me so many time as well.

kunley
0 replies
16h30m

Brendan Gregg as always with down to earth approach. Love the warroom example

josephcsible
0 replies
15h11m

and...permission errors. What!? I'm root, this makes no sense.

This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.

donio
0 replies
13h55m

I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you can capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.

anthk
0 replies
7h21m

tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more. Under OpenBSD this is not an issue as you have systat and friends in base.

SuperHeavy256
0 replies
16h3m

So basically busybox?