HN comments for: Linux Crisis Tools

FridgeSeal

32 replies

16h29m

2024-03-24 01:57:42 UTC

This is a handy list.

4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…

Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.

userbinator

21 replies

14h35m

2024-03-24 03:51:51 UTC

Dig into machine off the hot path.

Unfortunately, no one has the time to do that (or let someone do it) after the problem is "solved", so over time the "rebuild from scratch" approach just results in a loss of actual troubleshooting skills and acquired knowledge --- the software equivalent of a "parts swapper" in the physical world.

patrick451

9 replies

14h13m

2024-03-24 04:14:07 UTC

The end state of a culture that embraces restart/reboot/clear-cache instead of real diagnoses and troubleshooting is a cohort of junior devs who just delete their git repo and reclone instead of figuring out what a detached HEAD is.

I don't really fault the junior dev who does that. They are just following the "I don't understand something, so just start over" paradigm set by seniors.

yosefk

4 replies

13h47m

2024-03-24 04:40:13 UTC

To be fair, with git, specifically, it's a good idea to at least clone for backup before things like major merges. There are lots of horror stories from people losing work to git workflow issues and I'd rather be ridiculed as an idiot who is afraid of "his tools" (as if I have anything like a choice when using git) and won't learn them properly than lose work thanks to a belief that this thing behaves in a way which can actually be learned and followed safely.

A special case of this is git rebase after which you "can" access the original history in some obscure way until it's garbage-collected; or you could clone the repo before the merge and then you can access the original history straightforwardly and you decide when to garbage-collect it by deleting that repo.

theptip

1 replies

12h50m

2024-03-24 05:36:45 UTC

Git is a lot less scary when you understand the reflog; commit or stash your local changes and then you can rebase without fear of losing anything. (As a bonus tip, place “mybranch.bak” branches as pointers to your pre-rebase commit sha to avoid having to dig around in the reflog at all.)

I would never ridicule anyone for your approach, just gently encourage them to spend a few mins to grok the ‘git reflog’ command.

ansgri

0 replies

8h58m

2024-03-24 09:28:38 UTC

Then submodules enter the picture. I’m comfortable with reflog, but haven’t fully grokked submodules yet, easier to reclone.

jonathanlydall

1 replies

2h47m

2024-03-24 15:40:08 UTC

If you’re not super comfortable with Git, before rebasing, simply:

- Commit any pending changes.

- Make a git tag at your current head (any name is fine, even gibberish).

If anything “goes wrong” you can rollback by simply doing reset hard to the tagged commit.

Once done, delete the tag.

Making a complete “backup clone” is a complete waste of time and disk space.

ekimehtor

0 replies

2h21m

2024-03-24 16:05:36 UTC

Isn't the whole purpose of GIT Version Control? In other words to prevent work loss occurring from mergers and/or updates? Maybe I'm confusing GitHub with GIT? PS I want to set up a server for a couple of domain names I recently acquired, it has been many years so I'm not exactly sure if this is even practical anymore. Way back when I used to distribution based off of CENT OS called SME server, is it still common place to use a all in one distribution like that? Or is it better to just install my preferred flavour of Linux and each package separately?

zettabomb

1 replies

13h39m

2024-03-24 04:47:36 UTC

Honestly, there's a certain cost-benefit analysis here. In both instances (rebooting and recloning), it's a pretty fast action with high chances of success. How much longer does it take to find the real, permanent solution? For that matter, how long does it take to even dig into the problem and familiarize yourself with its background? For a business, sometimes it's just more cost effective to accept that you don't really know what the problem is and won't figure it out in less time than it takes to cop-out. Personally, I'm all in favor of actually figuring out the issue too, I just don't believe it to be appropriate in every situation.

patrick451

0 replies

13h15m

2024-03-24 05:12:11 UTC

There is a short term calculus and long term calculus. Restarting usually wins in the short term calculus. But if you double down on that strategy too much, your engineering team, and culture writ large, will lilt increasingly towards a technological mysticism.

hnlmorg

1 replies

7h33m

2024-03-24 10:54:14 UTC

It’s not either / or.

If you have proper observability in place then you can do your diagnosis without affecting your customers.

fuzzfactor

0 replies

55m

2024-03-24 17:32:13 UTC

diagnosis without affecting your customers.

Plus, at the same time successful diagnosis is also the kind that can have the most dramatic effect on your customers.

In a positive way.

Salgat

3 replies

13h0m

2024-03-24 05:27:21 UTC

If it's happening so rarely that killing is a viable solution, then there's no reason to troubleshoot it to begin with. If it's happening often enough to warrant troubleshooting, then your concerns are addressed.

whirlwin

1 replies

12h37m

2024-03-24 05:49:30 UTC

That might work in some scenarios. If you're a "newer" company where each application is deployed onto individual nodes, you can do this.

But consider that the case for older companies, where it was more common to deploy several systems, often complex ones, onto the same node. You will also cause outages to system x, y and z too. Maybe some of them are inter-dependent? You have to outwhey the consequences and risks carefully in any situation before rebooting.

bschne

0 replies

11h33m

2024-03-24 06:53:48 UTC

Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this.

At least as I read it, this contains the assumption that that‘s not how you deploy your applications

crabbone

0 replies

4h28m

2024-03-24 13:59:20 UTC

Here's a real-life example. We have a KVM server that has its storage on Ceph. It looks like KVM doesn't work well with Ceph, esp. when MD is involved, so, if a VM is powered off instead of an orderly shutdown, something bad is happening to MD metadata, and when the VM is turned on again, one MD replica can be missing. This happens infrequently, and I've never been in a situation when two replicas died at the same time (which would prevent a VM from booting), but it's obviously possible.

So... more generally, your idea with replacing VMs is rather naive when it comes to storage. Replacement incurs penalties, s.a. eg. RAID rebuilds. RAIDs don't have the promised resiliency during rebuild. And, in general, rebuilds are costly because they move a lot of data / wear the hardware by a lot. Worst yet, if you experience the same problem that caused you to start a rebuild in the first place during the rebuild, the whole system is a write-off.

In other words, it's a bad idea to fix problems without diagnosing them first if you want your system to be reliable. In extreme cases, this may start a domino effect, where the replacement will compound the problem, and, if running on rented hardware, may also be very financially damaging: there were stories about systems not coping with load-balancing and spawning more and more servers to try and mitigate the problem, where problem was, eg. a configuration that was copied to the newly spawned servers.

Spivak

2 replies

13h37m

2024-03-24 04:50:14 UTC

Y'all don't do post-mortem investigations / action items?

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait. I once had to deal with an outage that required we kill all our app servers every 20 minutes (staggered of course) because of a memory leak while it was being investigated.

kqr

0 replies

7h22m

2024-03-24 11:05:07 UTC

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait.

What numbers went into this calculation, to get such an extreme result as concluding that getting it up again is always the first priority?

When I tried to estimate the cost and benefit, I have been surprised to make the opposite conclusion multiple times. We ended up essentially in the situation of "Yeah, sure, you can reproduce the outage in production. Learn as much as you possibly can and restore service after an hour."

This is in fact the reason I prefer to keep some margin in the SLO budget -- it makes it easier to allow troubleshooting an outage in the hot path, and it frontloads some of that difficult decision.

Salgat

0 replies

12h58m

2024-03-24 05:29:17 UTC

Usually depends on the impact. If it's one of many instances behind a load balancer and was easily fixed with no obvious causes, then we move on. If it happens again, we have a known short-term fix and now we have a justified reason to devote man-hours to investigating and doing a post-mortem.

Fatnino

2 replies

10h32m

2024-03-24 07:55:01 UTC

I was at a place where we had "worker" machines that would handle incoming data with fluctuating volume. If the queues got too long we would automatically spin up new worker instances and when it came time to spin down we would kill the older ones first.

You can probably see where this is going. The workers had some problem where they would bog down if left running too long. Causing the queues to back up and indirectly causing themselves to eventually be culled.

Never did figure out why they would bog down. We just ran herky jerky like this for a few years till I left. Might still be doing it for all I know.

aequitas

1 replies

5h9m

2024-03-24 13:17:36 UTC

The workers had some problem where they would bog down if left running too long.

So you just automatically replace the instances after a certain amount of runtime and your problem is gone.

lucianbr

0 replies

4h33m

2024-03-24 13:54:16 UTC

Yeah, fixing a problem without understanding it has some disadvantages. It works sometimes, but the "with understanding" strategy works much more often.

Is this really a prevailing attitude now? Who cares what happened, as long as we can paper over it with some other maneuver/resources? For me it's both intellectually rewarding and skill-building to figure out what caused the problem in the first place.

I mean, I hear plenty of managers with this attitude. But I really expect better on a forum called hacker news.

eversincenpm

0 replies

3h48m

2024-03-24 14:38:46 UTC

One could argue that most devs these days are parts swappers with all the packages floating around.

throw5323446

6 replies

16h22m

2024-03-24 02:04:32 UTC

Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one.

"4:10pm the new machine still has the same performance issue"

jandrese

2 replies

15h45m

2024-03-24 02:41:45 UTC

4:20pm Turns out it was DNS

Propelloni

1 replies

14h3m

2024-03-24 04:23:46 UTC

That made me laugh. Thank you. Of course, it is not DNS. DNS has become the new cabling. DNS is not especially complicated, but cabling is neither. Yet, during dot.com and subsequent years the cabling was causing a lot of the problems so that we get used to first check the cabling. But it only took a few more years to realize that it is not always cabling, actually failures are normally distributed.

Is it wrong to check DNS first? No, but please realize that DNS misconfiguration is not more common than other SNAFUS.

ninkendo

0 replies

6h53m

2024-03-24 11:34:13 UTC

    It’s not DNS
    There’s no way it’s DNS
    It was DNS

FridgeSeal

1 replies

16h8m

2024-03-24 02:18:39 UTC

Sure, but more often than not - esp in cloud scenarios, sometimes you just get a machine that is having a bad day and it’s quicker to just eject it, let the rest of the infra pick up the slack, and then debug from there. Additionally if you’ve axed a machine, and got the same issue, you know it’s not a machine issue, so either go look at your networking layer or whatever configs you’re using to boot your machines from…

tjoff

0 replies

10h1m

2024-03-24 08:26:26 UTC

esp in cloud scenarios

... so the nice thing about the about the cloud is that you can workaround cloud-specific issues?

SerCe

0 replies

14h29m

2024-03-24 03:58:12 UTC

That's actually amazing, a reproducible problem is a 90% solved problem!

KingOfCoders

1 replies

12h10m

2024-03-24 06:17:02 UTC

Kill the machine might destroy evidence. It might be the case you have everything logged outside, but most often there is something missing.

monkpit

0 replies

11h45m

2024-03-24 06:41:43 UTC

Take it out of the pool then.

Jedd

0 replies

15h48m

2024-03-24 02:38:40 UTC

You're describing one of the benefits of virtualised cattle, not necessarily or exclusively 'cloud'.

rr808

9 replies

15h6m

2024-03-24 03:20:36 UTC

You guys get root access? I have to raise a ticket for a sysadmin to do anything.

zer00eyz

7 replies

14h38m

2024-03-24 03:48:43 UTC

I am a consultant now so it's a new company every few months.

There are groups of people you always make nice with.

* Security people. The kinds with poorly fit blazers who let you into the building. Learn these peoples names, Starbucks cards are your friends.

* Cleaning people. Be nice, be polite, again learn names. Your area will be spoltless. It's worth staying late every now and again just to get to know these folks.

* Accounting: Make some friends here. Get coffee, go to lunch, talk to them about non work shit, ask about their job, show interest. If you pick the right ones they are gonna grab you when layoffs are coming or corp money is flowing (hit your boss up for extra money times).

* IT. The folks who hand out laptops, manage email. Be nice to these people. Watch how quickly they rip bullshit off your computer or wave some security nonsense. Be first in line for every upgrade possible.

* Sysadmins. These are the most important ones. Not just because "root" but because a good SA knows how to code but never says it out loud. A good sysadmin will tell you what dark corners have the bodies and if it's just a closet or a whole fucking cemetery. If you learn to build into their platform (hint for them containers are how they isolate your shitty software in most cases) then you're going to get a LOT more leeway. This is the one group of people who will ask you for favors and you should do them.

rkachowski

3 replies

8h58m

2024-03-24 09:28:38 UTC

Starbucks cards are your friends

like, how? are you straight up bribing people with coffee for security favors? or is it like, "hey man, thanks for helping me out I'd like to buy you a coffee but I'm busy with secret consulting stuff - here's a gift card"

Is this something that only works for short lived external consultant interactions?

zer00eyz

0 replies

3h25m

2024-03-24 15:01:30 UTC

You learn peoples names, you say high every day you treat them like humans. You bring them coffee on occasion if it is early in the morning... or you ask them if they want something if your going for that post lunch pick me up for your self.

By the end of a 4-5 week run you will know all the security people in a building. If I go to lunch an forget my badge they will let me back in no questions asked. This is something I used to do as staff, and still do to this day.

sevagh

0 replies

4h7m

2024-03-24 14:19:42 UTC

You do that right after walking into the manager's office and getting a job with a firm handshake. Then you go outside and buy a hotdog for 15 cents and a detached house in San Francisco for $15,000 USD.

awithrow

0 replies

2h20m

2024-03-24 16:07:27 UTC

Just give one as a gift occasionally. The holiday season is great for this. On your way in, "Merry Christmas Frank!" and hand one out. Or even just because. "Keep up the good work, here you go"

Its not about bribing to get a specific favor. Its about getting on good terms. Having people like you is a good thing and it makes their job a bit better and can make their day a little brighter. win-win

ozim

1 replies

7h28m

2024-03-24 10:59:25 UTC

Much easier to be nice as default ;D

phyzome

0 replies

3h37m

2024-03-24 14:50:15 UTC

Agreed. Although being "extra nice" -- going out of your way to learn about people, eat with them, etc. -- does take extra time, so you can't do that with everyone.

Propelloni

0 replies

13h31m

2024-03-24 04:55:32 UTC

So true. If you want to know anything about an office, ask the sysadmins. Double-plus on being nice to the facility managers, cleaning people and security. Not only do they do a thankless job but they are often the most useful and resourceful people around if you need something taken care of. They know how to get shit done.

Propelloni

0 replies

13h39m

2024-03-24 04:47:40 UTC

Err, sure. I used to run IT ops (SYS, SRE, and SEC in this context). This article is directed at people who run apps on IT provided infrastructure. But if you would have interactions like in the example, your org failed on an org level, this is not a tech problem. We used to have very clear and very trustworthy lines of communication and people wouldn't be on chat, they would be on the phone (or today on Teams or whatever) with dev, ops, security, and compliance. Actually, we had at least a liaison on every team, but most often dev ran the apps on ops provided resources. Compliance green-lighted the setup and SR was a dev job. A lot of problems really go away if you do devops in this sense.

mmh0000

8 replies

16h17m

2024-03-24 02:09:48 UTC

I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.

vram22

4 replies

10h18m

2024-03-24 08:09:03 UTC

fuser and lsof are useful too.

https://man7.org/linux/man-pages/man1/fuser.1.html

https://en.m.wikipedia.org/wiki/Lsof

vram22

1 replies

10h0m

2024-03-24 08:26:41 UTC

And I wonder if anyone still uses sar and family.

I didn't, but a boss or two of mine did.

tremon

0 replies

3h16m

2024-03-24 15:11:00 UTC

I still use sar occasionally, but never as a troubleshooting tool. It's more for performance analysis than crisis mode.

kristjansson

1 replies

2h51m

2024-03-24 15:36:02 UTC

TIL fuser. Thanks!

vram22

0 replies

2024-03-24 18:21:03 UTC

Welcome :)

brendangregg

2 replies

16h13m

2024-03-24 02:14:07 UTC

strace is ok as a last resort, but "perf trace" and bpf tracing tools are the production-safe alternative. https://www.brendangregg.com/blog/2014-05-11/strace-wow-much...

slacka

1 replies

15h56m

2024-03-24 02:30:54 UTC

Why don't recommend atop? When a system is unresponsive, I want a I want a high-level tool that immediately shows which subsystem is under heavy load. It should show CPU, Memory, Disk, and Network usage. The other tools you listed are great, once you know what the cause is.

brendangregg

0 replies

15h39m

2024-03-24 02:48:19 UTC

My preference is tools that give a rolling output as it let you capture the time-based pattern and share it with others, including in JIRA tickets and SRE chatrooms, whereas top's generally clear the screen. atop by default also sets up logging and runs a couple of daemons in systemd, so it's more than just a handy tool when needed, it's now adding itself to the operating table. (I think I did at least one blog post about performance monitoring agents causing performance issues.) Just something to consider.

I've recommended atop in the past for catching short-lived processes because it uses process accounting, although the newer bpf tools provide more detail.

zer00eyz

5 replies

14h46m

2024-03-24 03:40:54 UTC

The only thing I would add is nmap.

Network connectivity issues aren't always apparent in some apps.

sneak

4 replies

12h17m

2024-03-24 06:10:02 UTC

screen/tmux byobu pv rsync and of course vim.

vram22

3 replies

10h5m

2024-03-24 08:21:54 UTC

dd, echo * as a poor man's ls if ls is accidentally deleted, busybox, cpio, fsck and fsdb.

Used all of these and more, in Unix, not just Linux crisis situations.

sneak

2 replies

10h2m

2024-03-24 08:24:50 UTC

Those are already there. We are talking about diagnostic and recovery tools that should be installed by policy, in advance, so that they are already in place to aid in emergencies.

vram22

1 replies

9h58m

2024-03-24 08:28:49 UTC

Okay, my mistake. But busybox is not always already there, right? Installed, I mean? Not at a box right now.

bostik

0 replies

6h10m

2024-03-24 12:17:21 UTC

Busybox has one big downside: the tools it provides tend to have a rather ... limited set of options available. The easy stuff you can do in a standard shell might not be supported.

reilly3000

4 replies

16h20m

2024-03-24 02:07:02 UTC

In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:

Build a container with a one-liner:

docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOF

Run attached to the host network:

docker run -dP --net=host moremagic/docker-netstat

Run system tools attached to read host processes:

for sysstat_tool in iostat sar vmstat mpstat pidstat; do alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}" done unset -v sysstat_tool

Sure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.

blueflow

1 replies

16h18m

2024-03-24 02:09:04 UTC

Is there a situation where apt cant download and install packages but docker can fetch new containers?

apt libs borked or something?

Smar

0 replies

15h10m

2024-03-24 03:16:51 UTC

I would just decompress the .deb in such case. As a last resort, even a .rpm might work.

Of course handling dependencies by hand is annoying, but depending on situation it might be faster anyway.

xyst

0 replies

14h34m

2024-03-24 03:52:42 UTC

Unless you are in an air gapped situation. Good luck pulling “Ubuntu” image!

supriyo-biswas

0 replies

13h57m

2024-03-24 04:29:29 UTC

On that note I'd largely prefer if `busybox` contained more of these tools, it'd be very helpful to have a 1MBish file that I can upload into a server and run it there.

SamuelAdams

4 replies

16h26m

2024-03-24 02:01:28 UTC

Would these tools still be useful in a cloud environment, such as EC2?

Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?

yla92

0 replies

16h13m

2024-03-24 02:13:59 UTC

It's still useful in EC2 (or any other VM-based environments) and Docker containers, as long as you can install the necessary packages (if they are not installed by default). Because after all, there are "servers" underneath, even for the serverless apps, I suppose.

It's definitely harder for apps running in Lambda because we may not have access to the underlying OS. In such case, I kind of fallback to using the application level observability tools like Pyroscope (https://pyroscope.io). It doesn't always work for all the cases and have some overheads/set up but it's still better than flying and more useful than the Cloud Provider's provided metrics.

ranger207

0 replies

16h5m

2024-03-24 02:22:02 UTC

IME there's always that one service that wasn't ever migrated to containers or lambdas is is off running on an EC2 somewhere, and nobody knows about it because it never breaks, but then the one time AWS schedules an instance retirement for it...

mdekkers

0 replies

16h18m

2024-03-24 02:09:15 UTC

Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8.

There are plenty of services that don’t fit on k8s or Lambda. Not all pegs fit in those holes.

cpuguy83

0 replies

4h31m

2024-03-24 13:55:42 UTC

Containers are just processes running on the host where the process has a different view of the world from the "host". The host can see all and do all.

js4ever

3 replies

7h34m

2024-03-24 10:52:50 UTC

Let's add NCDU to the list, it's super usefull to find what is taking all the disk space

kqr

2 replies

7h32m

2024-03-24 10:55:21 UTC

I keep forgetting about ncdu thanks to my old habit of du -ms * | sort -n. What is it I'm missing?

IshKebab

1 replies

5h29m

2024-03-24 12:58:05 UTC

Lots. ncdu is a fully interactive file browser that also lets you delete files and directories without a rescan.

natebc

0 replies

3h32m

2024-03-24 14:54:43 UTC

ncdu will also store results in an output file you can pull back and do analysis on. I've found this feature useful in some contexts.

pjmlp

2 replies

10h43m

2024-03-24 07:43:56 UTC

The list is great, but only for classical server workloads.

Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.

And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.

cpuguy83

1 replies

4h34m

2024-03-24 13:52:46 UTC

You don't need to have these tools in the container to troubleshoot the workload in a container.

pjmlp

0 replies

2h47m

2024-03-24 15:40:28 UTC

You would be surprised, specially if developers didn't care about telemetry.

logifail

2 replies

11h32m

2024-03-24 06:55:05 UTC

Doesn't one increase a system's attack surface area/privilege escalation risk by pre-installing tools such as these?

citrin_ru

0 replies

6h35m

2024-03-24 11:52:07 UTC

How do you see an escalation using one of listed in the article tool (unless a binary has suid bit which you shouldn’t set if worried about security). Many of these tools provide convenient access to /proc - if an attacker needs something there they can read/write directly to /proc. Though in case of eBPF - disabled kernel support would reduce attack surface and if it disabled in the kernel’s user mode tools are useless.

c0l0

0 replies

7h9m

2024-03-24 11:18:04 UTC

Usually (not by design, but by circumstance), if someone gains RCE on your systems, they can also find a way to bring the tools they need to do whatever they originally set out to do. It's the old "I don't want to have a compiler installed on my system, that's dangerous, unnecessary software!"-trope driven to a new extreme. Unless the executables installed are a means to somehow escalate privileges (via setuid, file-based capabilities, a too-open sudo policy, ...), having them installed might be a convenience for a successful attacker - but very rarely the singular inflection point at which their attempted attack became a successful one.

The times I've been locked in an ill-equipped container image that was stripped bare by some "security" crapware and/or guidelines and that made debugging a problem MUCH harder than it should have been vastly outnumber the times where I've had to deal with a security incident because someone had coreutils (or w/e) "unnecessarily" installed. (The latter tally is at zero, for the record.)

devsda

2 replies

14h19m

2024-03-24 04:08:01 UTC

Not all servers are containerized, but a significant number are and they present their own challenges.

Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.

To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.

Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.

Too

1 replies

10h45m

2024-03-24 07:42:08 UTC

A better way is to build a second image including the debug tools and a root-user, then start it with the prod-containers pid-namespace and network-namespace mounted.

Starting a second container is usually a good idea anyway, since you need to add a lot of extra flags like SYS_PTRACE capability, user 0 and --privileged for debuggers to work.

This way you don't need to restart the prod-container either, potentially loosing reproduction-evidence.

Remembering how to do all this in an emergency may not be entirely obvious. Make sure to try it first and write down the steps in your run books.

devsda

0 replies

1h1m

2024-03-24 17:25:55 UTC

A better way is to build a second image including the debug tools and a root-user.

That was our initial idea. But management and QA are paranoid enough that they consider these as new set of images that require running the complete test suite again even when they are built on top of certified images. Nobody is willing to test twice, so we had to settle for this middle.

randomgiy3142

1 replies

15h46m

2024-03-24 02:41:24 UTC

I use zfsbootmenu with hrmph (https://github.com/leahneukirchen/hrmpf). You can see the list of packages here (https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).

I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

semi-extrinsic

0 replies

7h49m

2024-03-24 10:37:43 UTC

On compute clusters there are quite a few "exotic" things that can go wrong. The workload orchestration is typically SLURM, which can throw errors and has a million config options to get lost in.

Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.

Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.

Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.

pstuart

1 replies

16h31m

2024-03-24 01:55:40 UTC

Sounds like it's time to create a crisis-essential package group a la build-essential.

yjftsjthsd-h

0 replies

14h49m

2024-03-24 03:37:56 UTC

I have in the past created a package list in ansible/salt/chef/... called devops_tools or whatever to make sure we had all the tools installed ahead of time.

infofarmer

1 replies

9h17m

2024-03-24 09:09:45 UTC

somewhat related: /rescue/* on every FreeBSD system since 5.2 (2004) — a single statically linked ~17MB binary combining ~150 critical tools, hardlinked under their usual names

https://man.freebsd.org/cgi/man.cgi?rescue https://github.com/freebsd/freebsd-src/blob/main/rescue/resc...

washadjeffmad

0 replies

4h57m

2024-03-24 13:30:06 UTC

And I haven't needed to use it in fifteen years. Over the past four or five years, I've ported what I can to a *BSD, for sanity reasons.

ur-whale

0 replies

10h23m

2024-03-24 08:03:37 UTC

Can't imagine handling a Linux crisis without ssh

[EDIT]: typo

sirwitti

0 replies

11h8m

2024-03-24 07:18:41 UTC

Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.

This probably would have prevented a stressful incident 3 weeks ago.

sargun

0 replies

13h14m

2024-03-24 05:12:58 UTC

When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)

These were a lifesaver multiple times.

prydt

0 replies

15h45m

2024-03-24 02:42:05 UTC

Love the list and the eBPF tools look super helpful.

michaelhoffman

0 replies

3h44m

2024-03-24 14:42:48 UTC

When would you need to use rdmsr and wrmsr in a crisis?

kureikain

0 replies

10h22m

2024-03-24 08:05:27 UTC

I don't see nmap, netstat, and nc being mention. They had saved me so many time as well.

kunley

0 replies

16h30m

2024-03-24 01:57:00 UTC

Brendan Gregg as always with down to earth approach. Love the warroom example

josephcsible

0 replies

15h11m

2024-03-24 03:16:21 UTC

and...permission errors. What!? I'm root, this makes no sense.

This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.

donio

0 replies

13h55m

2024-03-24 04:31:42 UTC

I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you can capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.

anthk

0 replies

7h21m

2024-03-24 11:05:43 UTC

tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more. Under OpenBSD this is not an issue as you have systat and friends in base.

SuperHeavy256

0 replies

16h3m

2024-03-24 02:23:46 UTC

So basically busybox?