return to table of content

Show HN: Xcapture-BPF – like Linux top, but with Xray vision

jamesy0ung
19 replies
20h46m

I’ve never used eBPF, does anyone have some good resources for learning it?

mgaunard
8 replies
20h36m

It lets you hook into various points in the kernel; ultimately you need to learn how the Linux kernel is structured to make the most of it.

Unlike a module, it can only really read data, not modify data structures, so it's nice for things like tracing kernel events.

The XDP subsystem is particularly designed for you to apply filters to network data before it makes it to the network stack, but it still doesn't give you the same level of control or performance as DPDK, since you still need the data to go to the kernel.

tanelpoder
4 replies
20h29m

Yep (the 0x.tools author here). If you look into my code, you'll see that I'm not a good developer :-) But I have a decent understanding of Linux kernel flow and kernel/app interaction dynamics, thanks to many years of troubleshooting large (Oracle) database workloads. So I knew exactly what I wanted to measure and how, just had to learn the eBPF parts. That's why I picked BCC instead of libbpf as I was somewhat familiar with it already, but fully dynamic and "self-updating" libbpf loading approach is the goal for v3 (help appreciated!)

tptacek
1 replies
18h18m

I was going to ask "why BCC" (BCC is super clunky) but you're way ahead of us. This is great work, thanks for posting it.

tanelpoder
0 replies
18h1m

Yeah, I already see limitations, the last one was yesterday when I installed earlier Ubuntu versions to see how far back this can go - and even Ubuntu 22.04 didn't work out of the box, ended up with some BCC/kernel header mismatch issue [1] although the kernel itself supported it. A workaround was to download & compile the latest BCC yourself, but I don't want to go there as the customers/systems I work on wouldn't go there anyway.

But libbpf with CO-RE will solve these issues as I understand, so as long as the kernel supports what you need, the CO-RE binary will work.

This raises another issue for me though, it's not easy, but easier, for enterprises to download and run a single python + single C source file (with <500 code lines to review) than a compiled CO-RE binary, but my long term plan/hope is that I (we) get the RedHats and AWSes of this world to just provide the eventual mature release as a standard package.

[1] https://github.com/iovisor/bcc/issues/3993

mgaunard
1 replies
20h16m

Myself I've only built simple things, like tracing sched switch events for certain threads, and killing the process if they happen (specifically designed as a safety for pinned threads).

tanelpoder
0 replies
20h7m

Same here, until now. I built the earlier xcapture v1 (also in the repo) about 5 years ago and it just samples various /proc/PID/task/TID pseudofiles regularly, it also allows you get pretty far with the thread-level activity measurement approach, especially when combined with always-on low frequency on-CPU sampling with perf.

tptacek
2 replies
18h44m

XDP, in its intended configuration, passes pointers to packets still on the driver DMA rings (or whatever) directly to BPF code, which can modify packets and forward them to other devices, bypassing the kernel stack completely. You can XDP_PASS a packet if you'd like it to hit the kernel, creating an skbuff, and bouncing it through all the kernel's network stack code, but the idea is that you don't want to do that; if you do, just use TC BPF, which is equivalently powerful and more flexible.

mgaunard
1 replies
6h55m

Yes for XDP there is a dedicated API, but for any of the other hooks like tracepoints, it's all designed to give you read-only access.

The whole CO-RE thing is about having a kernel-version-agnostic way of reading fields from kernel data structures.

tptacek
0 replies
1h30m

Right, I'm just pushing back on the DPDK thing.

tanelpoder
6 replies
20h42m

Brendan Gregg's site (and book) is probably the best starting point (he was involved in DTrace work & rollout 20 years ago when at Sun) and was/is instrumental in pushing eBPF in Linux even further than DTrace ever went:

https://brendangregg.com/ebpf.html

bcantrill
5 replies
18h9m

Just a quick clarification: while Brendan was certainly an active DTrace user and evangelist, he wasn't involved in the development of DTrace itself -- or its rollout. (Brendan came to Sun in 2006; DTrace was released in 2003.) As for eBPF with respect to DTrace, I would say that they are different systems with different goals and approaches rather than one eclipsing the other. (There are certainly many things that DTrace can do that eBPF/BCC cannot, some of the details of which we elaborated on in our 20th anniversary of DTrace's initial integration.[0])

Edit: We actually went into much more specific detail on eBPF/BCC in contrast to DTrace a few weeks after the 20th anniversary podcast.[1]

[0] https://www.youtube.com/watch?v=IeUFzBBRilM

[1] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m7s

anonfordays
2 replies
12h58m

As for eBPF with respect to DTrace, I would say that they are different systems with different goals and approaches

For sure. Different systems, different times.

rather than one eclipsing the other.

It does seem that DTrace has been eclipsed though, at least in Linux (which runs the vast majority of the world's compute). Is there a reason to use DTrace over eBPF for tracing and observability in Linux?

There are certainly many things that DTrace can do that eBPF/BCC cannot

This may be true, but that gap is closing. There are certainly many things that eBPF can do that DTrace cannot, like Cilium.

tanelpoder
1 replies
12h52m

Perhaps familiarity with the syntax of DTrace if coming from Solaris-heavy enterprise background. But then again, too many years have passed since Solaris was a major mainstream platform. Oracle ships and supports DTrace on (Oracle) Linux by the way, but DTrace 2.0 on Linux is a scripting frontend that gets compiled to eBPF under the hood.

Back when I tried to build xcapture with DTrace, I could launch the script and use something like /pid$oracle::func:entry/ but IIRC the probe was attached only to the processes that already existed and not any new ones that were started after loading the DTrace probes. Maybe I should have used some lower level APIs or something - but eBPF on Linux automatically handles both existing and new processes.

bch
0 replies
2h13m

eBPF on Linux automatically handles both existing and new processes

Without knowing your particular case, DTrace does too - it’d certainly be tricky to use if you’re trying to debug software that “instantly crashes on startup” if it couldn’t do that. “execname” (not “pid”) is where I’d look, or perhaps that part of the predicate is skipable; regardless, should be possible.

tanelpoder
0 replies
17h54m

Thanks, yes I was more or less aware of that (I'd been using DTrace since Solaris 10 beta in 2004 or 2003?)... By rollout I really meant "getting the word out there"... that's half the battle in my experience (that's why this post here! :-)

What I loved about DTrace was that once it was out, even in beta, it was pretty complete and worked - all the DTrace ports that I've tried, including on Windows (!) a few years ago were very limited or had some showstopper issues. I guess eBPF was like that too some years ago, but by now it's pretty sweet even for more regular consumer who don't keep track of its development.

Edit: Oh, wasn't aware of the timeline, I may have some dates (years) wrong in my memory

rascul
0 replies
20h6m

You might find some interesting stuff here

https://ebpf.io/

lathiat
0 replies
17h15m

I'll toot my own horn here. But there are plenty of presentations about it, Brendan Gregg's are usually pretty great.

"bpftrace recipes: 5 real problems solved" - Trent Lloyd (Everything Open 2023) https://www.youtube.com/watch?v=ZDTfcrp9pJI

__turbobrew__
7 replies
19h4m

I use BCC tools weekly to debug production issues. Recently I found we were massively pressuring page caches due to having a large number of loopback devices with their own page cache. Enabling direct io on the loopback devices fixed the issue.

eBPF is really a superpower, it lets you do things which are incomprehensible if you don’t know about it.

tptacek
5 replies
18h18m

I'd love to hear more of this debugging story!

__turbobrew__
4 replies
15h39m

Containers are offered block storage by creating a loopback device with a backing file on the kubelet’s file system. We noticed that on some very heavily utilized nodes that iowait was using 60% of all the available cores on the node.

I first confirmed that nvme drives were healthy according to SMART, I then worked up the stack and used BCC tools to look at block io latency. Block io latency was quite low for the NVME drives (microseconds) but was hundreds of milliseconds for the loopback block devices.

This lead me to believe that something was wrong with the loopback devices and not the underlying NVMEs. I used cachestat/cachetop and found that the page cache miss rate was very high and that we were thrashing the page cache constantly paging in and out data. From there I inspected the loopback devices using losetup and found that direct io was disabled and the sector size of the loopback device did not match the sector size of the backing filesystem.

I modified the loopback devices to use the same sector size as the block size of the underlying file system and enabled direct io. Instantly, the majority of the page cache was freed, iowait went way down, and io throughout went way up.

Without BCC tools I would have never been able to figure this out.

Double caching loopback devices is quite the footgun.

Another interesting thing we hit is that our version of losetup would happily fail to enable direct io but still give you a loopback device, this has since been fixed: https://github.com/util-linux/util-linux/commit/d53346ed082d...

FooBarWidget
1 replies
8h6m

Which container runtime are you using? As far as I know both Docker and containerd use overlay filesystems instead of loopback devices.

And how did you know that tweaking the sector size to equal the underlying filesystem's block size would prevent double caching? Where can one get this sort of knowledge?

__turbobrew__
0 replies
3h36m

The loopback devices came from a CSI which creates a backing file on the kubelet’s filesystem and mounts it into the container as a block device. We use containerd.

I knew that enabling direct io would most likely disable double caching because that is literally the point of enabling direct io on a loopback device. Initially I just tried enabling direct io on the loopback devices, but that failed with a cryptic “invalid argument” error. After some more research I found that direct IO needs the sector size to match the filesystems block size in some cases to work.

M_bara
0 replies
1h1m

We had something similar about 10 years ago where I worked. Customer instances were backed via loopback devices to local disks. We didn’t think of this - face palm - on the loop back devices. What we ended up doing was writing a small daemon to posix fadvise the kernel to skip the page cache… your solution is way simpler and more elegant… hats off to you

jyxent
0 replies
17h0m

I've been learning BCC / bpftrace recently to debug a memory leak issue on a customer's system, and it has been super useful.

metroholografix
3 replies
17h10m

Folks who find this useful might also be interested in otel-profiling-agent [1] which Elastic recently opensourced and donated to OpenTelemetry. It's a low-overhead eBPF-based continuous profiler which, besides native code, can unwind stacks from other widely used runtimes (Hotspot, V8, Python, .NET, Ruby, Perl, PHP).

[1] https://github.com/elastic/otel-profiling-agent

3abiton
1 replies
11h54m

I am trying to wrap my head around it, still unclear what it does l.

zikohh
0 replies
28m

That's like most of Grafana's documentation

malkia
1 replies
18h36m

Relatively how expensive is to capture the callstack when doing sample profiling?

With Intel CET's tech there should be way to capture a shadow stack, that really just contains entry points, but wondering if that's going to be used...

tanelpoder
0 replies
18h16m

The on-cpu sample profiling is not a big deal for my use cases as I don't need the "perf" sampling to happen at 10kHz or anything (more like 10-1Hz, but always on).

But the sched_switch tracepoint is the hottest event, without stack sampling it's 200-500ns per event (on my Xeon 63xx CPUs), depending on what data is collected. I use #ifdefs to compile in only the fields that are actually used (smaller thread_state struct, fewer branches and instructions to decode & cache). Surprisingly when I collect kernel stack, the overhead jumps higher up compared to user stack (kstack goes from say 400ns to 3200ns, while ustack jumps to 2800ns per event or so).

I have done almost zero optimizations (and I figure using libbpf/BTF/CO-RE will help too). But I'm ok with these numbers for most of my workloads of interest, and since eBPF programs are not cast in stone, can do further reductions, like actually sampling stacks in the sched_switch probe on every 10th occurrence or something.

So in worst case, this full-visibility approach might not be usable as always-on instrumentation for some workloads (like some redis/memcached/mysql lookups doing 10M context switches/s on a big server), but even with such workloads, a temporary increase in instrumentation overhead might be ok, when there are known recurring problems to troubleshoot.