HN comments for: A Linux kernel syscall implementation tracker

Neywiny

15 replies

23h29m

2024-07-20 18:59:00 UTC

This is good. Usually I end up finding one I think Google made for Chromebooks. Or even worse, restoring to the man page.

Retr0id

14 replies

22h55m

2024-07-20 19:32:57 UTC

Where can I find syscall numbers in the man pages?

Neywiny

8 replies

21h13m

2024-07-20 21:15:18 UTC

Man syscall and man syscalls. I'm on Mobile and remoted into my wsl to find neither list the actual table at first blush, but I have seen it. It's probably in one of the related pages. This is why I like the site though. It's all just there.

tanelpoder

7 replies

21h1m

2024-07-20 21:26:44 UTC

The question was about syscall numbers. The man pages don't show the numbers.

The internal syscall numbers are defined in syscall_64.tbl in kernel source and you can parse the (libc-used) syscall numbers from various /usr/include/asm/syscall.h files, but these files don't necessarily contain all the available syscalls in your currently running kernel.

You can read your kernel's/platform's currently available syscalls from the relevant /sys/kernel/debug/tracing tracepoint files (I posted a link to my script in another comment here). That way you'll see all currently available system calls and their arguments, but tracefs doesn't show the syscall internal numbers (syscall table slot numbers), but rather the new syscall ID.

The internal syscall numbering can change when you switch/build a new kernel and syscalls have different numbers across platforms. Syscall 0 is "read" on x86_64, but is "io_setup" on aarch64, for example. The new syscall ID aims to provide stable numbering with no conflicts and overlaps across platforms and kernel versions, as I understand.

mebeim

5 replies

20h52m

2024-07-20 21:36:01 UTC

Fun fact, some weird syscalls don't even appear under /sys/kernel/debug/tracing because they lack ftrace metadata. It was pretty fun (read: a nightmare) to deal with some of those in my tool. You can grep -R -F "NOT found in ftrace metadata" in the logs in my db (https://github.com/mebeim/linux-syscalls/tree/master/db) to see which ones.

The most interesting one, which doesn't even appear in my logs because I had to hardcode it since its esoteric definition, is fast_endian_switch for PPC64 (https://elixir.bootlin.com/linux/v6.10/source/arch/powerpc/k...).

tanelpoder

4 replies

20h45m

2024-07-20 21:43:01 UTC

Good to know, I suspected that this might be the case, but never got to confirm this. I guess one could set up a test comparing the syscalls listed in syscall_64.tbl (or syscall table read from kernel memory) with the syscalls listed under /sys/kernel/debug/tracing/events/syscalls

mebeim

3 replies

20h40m

2024-07-20 21:48:41 UTC

Nope, not even that, because believe it or not, sometimes not even the .tbl files have all of them :'). In fact, the only arch where IMHO syscalls make sense and are organized in a sane way is arm64 that doesn't even have a .tbl file. And not even the table in kernel memory is enough sometimes! Some are special handlers in syscall entry code (like the one I mentioned above). It's just a mess, hence why I sort of gave up at some point and for some "esoteric" syscall I just hardcode them.

derefr

2 replies

19h2m

2024-07-20 23:26:03 UTC

It's just a mess, hence why I sort of gave up at some point and for some "esoteric" syscall I just hardcode them.

Presuming you don't want to keep doing this forever, but would rather do insane amounts of up-front work if it would enable you to never have to touch this again:

1. Have you considered writing some code that takes a configured + built kernel source tree; finds the intermediate build artifacts pertaining to the code unit that contains the syscall handler; and parses those? And then taking the resulting IR data-structure / AST / whatever, and doing some symbolic interpretation of it — to enable you to essentially do an xpath-like expression match on "does something specific with a concrete syscall number that isn't already in the known set for the arch"? AFAICT you could generate your own syscall table from that, and it would be exhaustive.

2. Have you considered dropping a little bit of driver-program code into the kernel source tree, that just "does syscall handling according to the passed-in paralemeters" — i.e. where the artifact built from compiling this file, would be an EFI-app pseudo-unikernel that naively pretends all kernel services were already initialized (they weren't); would do one syscall operation, calling directly into the syscall handler; and then would immediately halt afterward — and then feeding the resulting "executable" to https://github.com/google/AFL ?

mebeim

1 replies

7h1m

2024-07-21 11:27:16 UTC

Yeah the few "esoteric" syscalls that I hardcode in my tool are historical ones. That's pretty much the only reason why I bothered hardcoding them. I don't assume any new syscall will ever be implemented like that nowadays. Such insane implementations would be rejected unless there is a very specific compelling reason.

finds the intermediate build artifacts pertaining to the code unit that contains the syscall handler

Hmm I think this is unneeded, vmlinux already has all the code. Also things move around too much across kernel versions and archs so can't easily pinpoint which object files to choose. Additionally, you would need an entire built kernel source tree, which is a lot more than simply a built vmlinux plus an optional non-built kernel source dir (that is what I use right now). Just as an example: currently I have some 600 kernel images with debug info that I keep for reference, which requires around 76 Gigabytes of space on my disk. Having 600 built kernel trees would require a lot more space, in the order of Terabytes.

taking the resulting IR data-structure / AST / whatever, and doing some symbolic interpretation of it

I have been thinking about this a lot. I do a simplified version of this for x86 >= v6.9 because the syscall table was removed and turned into a giant switch case, which I symbolically emulate to extract syscall numbers, but that's pretty simple and definitely not an exhaustive analysis (some other stuff could be happening before reaching the handler). The problem is that this kind of solution is very hard to implement and I think would be way too slow on a general case. There also aren't even decent symbolic execution engines to do this for some archs. You are right when you say "insane amounts of up-front work" - that is definitely too much for me for a hobby project like this :').

The first main problem however is that all of this starts from the assumption that you already have built a kernel with all the syscalls available. This is not the case unless you meticulously configure it accordingly, which is not so simple and requires constant manual (sigh) updates to the build configuration each kernel release. There isn't a way to e.g. pretend that "all kernel services were already initialized" as you say in point #2. If a kernel is built w/o a certain syscall, the code will simply not be there. Kernel configuration remains a problem also for your point #1. The only real solution I see would be submitting kernel patch to add a target in the root Makefile that enables all syscalls with their related configs, and hope kernel devs like it (doubt it).

derefr

0 replies

1h2m

2024-07-21 17:26:20 UTC

Hmm I think this is unneeded, vmlinux already has all the code.

Yeah, I was just thinking about it as a way to reduce the scope of the "preload" step of symbolic interpretation, for the case where you want to work with semi-structured IR (GIMPLE) rather than machine code.

My assumption was that by the time you're down to machine code, you'll still be able to recover the key column of the table — the syscall numbers themselves — but the rest of the data you want to show in the table won't exist any more, having existed only as things like identifier names. So you'd want to back up at least one or two steps.

This is not the case unless you meticulously configure it accordingly, which is not so simple and requires constant manual (sigh) updates to the build configuration each kernel release.

I was less assuming the possibility of one kernel that has all syscalls, and more assuming that you could build O(N) "probe kernels", one per uarch.

I think the concept of there being "optional syscalls" that only appear if you configure in added capabilities beyond the uarch, didn't even occur to me.

How does that even work, libc-wise? I had assumed that the userland-kernel-ABI expectation was such that the set of syscalls possible to call for a given uarch is static, but with some just be stubbed to always return an error if the given capability isn't in the kernel. But I guess, if the "return an error like a stub" logic is the same as the "this syscall isn't implemented logic", then there needn't be any concrete code in the kernel that calls out those syscall numbers as existing...

If so, maybe consider that a bug? Submit a patch to have an arch's stubbed optional syscalls return a different error than for syscalls that don't exist for that arch, thus forcing such syscalls to be somehow documented in the kernel even when stubbed?

There isn't a way to e.g. pretend that "all kernel services were already initialized" as you say in point #2.

To be clear, I wasn't talking about compile-time code inclusion; I was talking about runtime, when using the strategy I outlined to compile a subset of the Linux kernel as a "library kernel" / exokernel. The kernel does a lot of stuff on boot — brings up hardware, starts daemons, etc — and you'd want to skip including any of that, if you wanted to throw the code into a fuzzer, because that'd all distract the fuzzer from your goal of fuzzing the syscall handler. So you'd want the executable you built to just call the syscall handler as if it was running in the context of a bootstrapped-and-running kernel — statically declaring all the same static globals, but just never calling the code to initialize any of it. So you'd likely get a program that always crashes with a null dereference — but that doesn't matter, since your goal is to discover through fuzzing the conjunction of value constraints that overdetermines the control-flow to reach one null dereference vs another.

Neywiny

0 replies

19h4m

2024-07-20 23:24:33 UTC

I understood the question, if you read my comment you'll note I acknowledged that. I think I was thinking of the signal numbers, which I was last looking for around the same time and had a similar man page hunt

chad1n

3 replies

22h30m

2024-07-20 19:58:08 UTC

You can't since man pages present the libc implementation, that's more useful if you work with syscalls that use structures since it shows you what to search for in libc to copy it in other language.

thayne

1 replies

21h28m

2024-07-20 21:00:04 UTC

man pages present the libc implementation

No, the section 3 man pages for "syscalls" are the libc wrapper functions. But section 2 is the syscalls themselves, and includes man pages for syscalls that don't have wrapper functions.

I don't think those man pages include the numbers though, since those numbers are architecture dependent.

mebeim

0 replies

21h26m

2024-07-20 21:01:44 UTC

Not really. Even though section 2 should be for "syscalls", it is really only for libc syscall wrappers. Very few pages in the section 2 document the raw syscalls, and those that do say it specifically at the beginning. OTOH, in section 3 you won't find any syscall at all (wrapper or not).

matheusmoreira

0 replies

12h40m

2024-07-21 05:48:42 UTC

Always wondered why there's non-Linux kernel information in the Linux man pages.

I mentioned this to Greg Kroah-Hartman when he did his second AMA on reddit, hoping he would comment on it.

https://old.reddit.com/r/linux/comments/fx5e4v/im_greg_kroah...

So we rely on different libc projects to provide this, and work with them when needed.

This ends up being more flexible as there are different needs from a libc, and for us to "pick one" wouldn't always be fair.

I think putting libc information in the Linux man page is effectively "picking one". The init section of the manual also contains systemd information, giving the impression it's the "official" init. I expected to read about the ways the Linux kernel treats PID 1 specially but got the systemd manual instead.

matheusmoreira

0 replies

12h30m

2024-07-21 05:58:22 UTC

The manuals have the following pages on system calls:

https://www.man7.org/linux/man-pages/man2/syscall.2.html

https://www.man7.org/linux/man-pages/man2/syscalls.2.html

There are also the manual pages for each individual system call.

The syscall numbers unfortunately cannot be found in the manual. They are found in published tables on the internet.

They are defined in numerous locations in the Linux kernel tree. They are included via the linux/unistd.h header which in turn includes the appropriate asm-generic/ and asm/ headers.

https://github.com/torvalds/linux/blob/master/include/uapi/a...

https://github.com/torvalds/linux/blob/master/tools/arch/x86...

https://github.com/torvalds/linux/blob/master/tools/arch/arm...

There's quite a bit of complexity here. The system call numbers are stable for each architecture but may differ between architectures. Some architectures have multiple historical versions of the same system call which are maintained for backwards compatibility, others have just the latest version of the relevant system call with the version number removed.

I assume this complexity is the reason why this information is not typically included. People expect you to rely on the libc which abstracts all this.

foresto

5 replies

22h57m

2024-07-20 19:31:05 UTC

That's handy.

I wonder why it displays without javascript in chromium, but fails to do so in firefox. If the author is here, could that be fixed?

Edit:

Restarting chromium and trying again yields the same behavior as firefox. I wonder if the javascript somehow slipped past umatrix & ublock origin on my first try. Given that I launched chromium by dragging the link onto its icon the first time, perhaps the script-blocking extensions weren't fully loaded?

Testing again several more times, that does seem likely. I can reproduce it intermittently by dragging the URL onto my chromium shortcut if chromium isn't already running.

Edit 2: Sure enough:

https://github.com/gorhill/uBlock/issues/1913

https://github.com/gorhill/uBlock/issues/1327

rasz

1 replies

20h49m

2024-07-20 21:39:43 UTC

slipped past umatrix & ublock origin on my first try. Given that I launched chromium by dragging the link onto its icon the first time, perhaps the script-blocking extensions weren't fully loaded?

Yes. Chrome will actually pause extensions loading to deliver you that first rendered picture fraction(arguable, probably slower in the end considering extensions load from disk) of a second faster. I think it was direct uBO sabotage.

TLDR: First website loaded by starting browser with a link or from last session has almost 100% chance of bypassing uBlockOrigin. Chrome and Chromium based browsers are not User Agents, they are Google Agents.

vdfs

0 replies

18h59m

2024-07-20 23:29:25 UTC

It also happen when opening a link in Incognito mode

drtgh

1 replies

21h56m

2024-07-20 20:32:13 UTC

I'm not the author.

The html sourcecode shows the content of the table is not served within the page, but is added on page load through javascript by formatting a json requested file. I changed the browser's user-agent to chrome and the same page was served (though the link to git shows it's an static page). My guess is may be chromium is not disabling javascript.

PS: I'm a Firefox user too, I always browse with Javascript disabled by uMatrix (in addition to uBlockOrigin), I only enable it when the web deserves it and leaving disabled third domains js loads almost always.

mebeim

0 replies

21h33m

2024-07-20 20:55:24 UTC

I'm the author and I can confirm. The website will not work with JS disabled simply because it's a static HTML "skeleton" page loading JSON tables with JS and populating a <table> element. If it's working then it means you must have JS enabled. I don't plan to add support for browsers without JS, but the JSON tables have all the information you need anyway, and those are just static files (e.g. https://syscalls.mebeim.net/db/x86/64/x64/latest/table.json).

o11c

0 replies

21h34m

2024-07-20 20:54:08 UTC

Chrome breaks security addons by design, in the name of "performance". It's purely a coincidence that Chrome is made by a major ad company.

xelxebar

4 replies

16h26m

2024-07-21 02:01:56 UTC

Okay, this is super cool. Thanks for sharing.

In a similar vein, jart's Cosmopolitan libc has a really fun collection of tables that compare various constants across platforms, e.g. syscalls, syscall flags, error numbers, etc. It includes (variants of) Linux, XNU, NT, and the BSDs.

https://github.com/jart/cosmopolitan/blob/master/libc/sysv/c...

In the off chance you haven't heard of Cosmopolitan yet, I hope you find the discovery as much fun as I have.

saagarjha

1 replies

13h46m

2024-07-21 04:41:51 UTC

I am curious what the difference is supposed to be between "XNU's Not UNIX!" and "MacOS (Arm64)".

pcwalton

0 replies

5h53m

2024-07-21 12:34:57 UTC

I'm guessing it's just x86-64 vs. AArch64. There are two columns, one marked "(Aarch64)", for Linux too.

I would imagine the "MacOS" bit is there to emphasize that the values haven't been verified on iOS.

runlevel1

0 replies

11h3m

2024-07-21 07:24:47 UTC

This is a thing of beauty. I used to make spreadsheets like this ages ago when I was working across Linux and Solaris, but they were nowhere near as thorough as this.

birktj

0 replies

5h51m

2024-07-21 12:37:40 UTC

GNU/Systemd is pretty hilarious

nubinetwork

4 replies

6h56m

2024-07-21 11:32:18 UTC

Can you make one for kernel exports and list whether they are GPL-only or not?

phoronixrly

1 replies

6h37m

2024-07-21 11:51:25 UTC

Why would you care about that?

nubinetwork

0 replies

5h44m

2024-07-21 12:44:22 UTC

Third-party driver development

mebeim

1 replies

6h15m

2024-07-21 12:13:03 UTC

That is just a simple `grep -R EXPORT_SYMBOL` or `grep -R EXPORT_SYMBOL_GPL`, isn't it? A table for that wouldn't have much value.

nubinetwork

0 replies

5h44m

2024-07-21 12:43:59 UTC

A table for that wouldn't have much value

Just because you can search the source, doesn't mean this wouldn't come in handy to someone some day.

netr0ute

3 replies

23h4m

2024-07-20 19:24:17 UTC

Missing RISC-V

stevefolta

0 replies

22h2m

2024-07-20 20:26:15 UTC

Yeah, it seems odd that it has PowerPC but not RISC-V.

mfranc42

0 replies

21h52m

2024-07-20 20:35:44 UTC

I'm missing s390x.

mebeim

0 replies

21h37m

2024-07-20 20:51:05 UTC

That's the next arch I want to add but it takes a bit of work, sooner or later I will add it though :')

extraduder_ire

3 replies

19h13m

2024-07-20 23:15:20 UTC

Only 357 syscalls in 6.0. Don't know why, but I thought there would be more.

vdfs

2 replies

19h2m

2024-07-20 23:26:40 UTC

It's 462

tanelpoder

0 replies

18h48m

2024-07-20 23:40:31 UTC

System call internal numbers are meaningless, are different across platforms and can change across kernel compiles/upgrades. There are also gaps in the internal numbering, so the max value seen in define _NR_syscall doesn't show the actual number of used syscall numbers in your current kernel...

Edit: this answer has some relevant details:

https://stackoverflow.com/questions/63713056/why-is-their-a-...

silisili

0 replies

11h38m

2024-07-21 06:50:07 UTC

462 is just the highest number. Many are skipped. It tells you at the bottom the total, 357.

MBCook

3 replies

19h56m

2024-07-20 22:32:19 UTC

Is there a reason syscall numbers don’t match up between architectures?

Or is it just a quirk of history?

jmgao

1 replies

19h40m

2024-07-20 22:48:04 UTC

They were renumbered in x86_64 so that the syscalls that are frequently used together have their function pointers live in the same cacheline in the lookup table: https://lkml.iu.edu/hypermail/linux/kernel/0104.0/0547.html

I vaguely remember reading somewhere that the MIPS ones are weird to support compatibility with the existing unix syscall numbering, but I can't find any evidence for that anywhere, so maybe it was aspirational or I'm hallucinating.

saagarjha

0 replies

16h36m

2024-07-21 01:52:28 UTC

I am curious how much this actually helps.

drewg123

0 replies

18h29m

2024-07-20 23:59:23 UTC

I'm pretty sure it is because when Linus ported Linux from x86 to the DEC Alpha in the early 90s, he used the DEC OSF/1 syscall numbers (and error numbers) so that he could bootstrap the Linux kernel from an OSF/1 userland. He probably should have had a flag day & normalized the syscall numbers to be arch independent like they are on *BSD, but he never did.

Coming from BSD, I find this very confusing and tend to grumble when I'm tracing something and have to go groveling from the right errno.h. Eg:

% find ~/linux/ -name 'errno*' | wc -l

% find ~/freebsd/sys -name 'errno*' | wc -l

wg0

2 replies

23h13m

2024-07-20 19:15:02 UTC

What API is the most common for x86-64bit?

mebeim

0 replies

21h30m

2024-07-20 20:58:33 UTC

The homepage loads exactly the most common x86-64 ABI, which is x64 (according to kernel naming). It's the one for 64-bit syscalls made by 64-bit code. On x86-64 you also have IA32 (32-bit syscalls made by 32-bit code) and x32 (64-bit syscalls made by 64-bit code specifically built to only use 32-bit pointers).

Retr0id

0 replies

22h53m

2024-07-20 19:34:54 UTC

The x86-64 ABI is the most common x86-64 ABI ;)

(x32 is rare https://en.wikipedia.org/wiki/X32_ABI )

qbane

2 replies

23h5m

2024-07-20 19:23:25 UTC

I wish this site could have existed earlier when I was writing eBPF filters for my sandbox and had to check how specific syscalls were implemented in different archs here and there. Thanks for your great work.

tanelpoder

1 replies

21h18m

2024-07-20 21:10:28 UTC

Do you mean arguments and the internal syscall number used for a syscall on your given platform?

I recently had enough of parsing the various syscall.h files on different architectures and wrote a debugfs syscall info reader instead. That way you can see all tracepoint-instrumented syscalls and arguments available exactly on your currently running kernel on your platform:

https://tanelpoder.com/posts/list-linux-system-call-argument...

Edit: changed "all" to "all tracepoint-instrumented" based on a comment below - some added syscalls don't (immediately) get instrumented with a tracepoint so tracefs wouldn't show them (until someone instruments them in a later kernel version as seems to be the case). The tracefs approach has been good enough for me, but the only 100% guaranteed way to see all currently available syscalls would be to read the syscall table from kernel memory and see which syscall handler kernel functions they call (as the syscall name itself is meaningless inside the kernel).

qbane

0 replies

11h12m

2024-07-21 07:16:17 UTC

Yes. My primary use case was to allow only some syscalls and block all others. Until I had to support multiple architectures and executables having different runtime behavior. I ended up attaching an debugger and searching every syscall I met one by one. My understanding to kernels then was not enough to reduce the development friction.

xurukefi

1 replies

10h48m

2024-07-21 07:40:19 UTC

removed

jcul

0 replies

10h41m

2024-07-21 07:47:14 UTC

The linked source seems to be checking (len_in && !len).

len_in being the passed argument and len being the page aligned len.

pastapoggers

1 replies

22h2m

2024-07-20 20:25:53 UTC

similarly, does a Windows syscall tracker exist?

Retr0id

0 replies

21h46m

2024-07-20 20:42:23 UTC

https://github.com/j00ru/windows-syscalls

matheusmoreira

1 replies

13h20m

2024-07-21 05:08:39 UTC

This is SUCH a good tool! Thank you!!

I've been using other tables but they were always incomplete and often x86_64 only. This one contains everything: number, symbol, links to kernel implementation, signature, user space ABI registers. And I can select kernel version, kernel binary interface and processor architecture!

I'm very interested in how you are collecting or generating all this information. Please post details on the process. I need similar information in order to compile system call tables into lone, my own programming language which features direct Linux system call support.

I use scripts that parse the information out of Linux user space API headers: the compiler prints all the preprocessor definitions from linux/unistd.h, the "SYS_" definitions are selected and then turned into a C array initializer for a number/name structure.

  # makefile
  $(call source_to_object,source/lone/lisp/modules/intrinsic/linux.c): $(targets.NR.c)

  $(targets.NR.c): $(targets.NR.list) scripts/NR.generate
      scripts/NR.generate < $< > $@

  $(targets.NR.list): scripts/NR.filter
      $(CC) -E -dM -include linux/unistd.h - < /dev/null | scripts/NR.filter > $@

  # scripts/NR.filter
  grep __NR_ | sed 's/#define //g' | cut -d ' ' -f 1

  # scripts/NR.generate
  # generates C array initializers like:
  #     { "read", __NR_read },
  while read -r NR; do
    printf '{ "%s", %s },\n' "${NR#__NR_}" "${NR}"
  done

  // source/lone/lisp/modules/intrinsic/linux.c

  static struct linux_system_call {
      char *symbol;
      lone_lisp_integer number;
  } linux_system_calls[] = {

      /* huge generated array initializer
       * with all the system calls found
       * on the host platform
       */
      #include <lone/lisp/modules/intrinsic/linux/NR.c>
  };

mebeim

0 replies

8h8m

2024-07-21 10:20:04 UTC

Thank you very much :). I am using static analysis of kernel images (vmlinux ELF) that are built with debug information. Each table you see was extracted from a kernel built by my tool, Systrack, that can configure and build kernels that have all the syscalls available. The code is heavily commented and available on GitHub if you are interested: https://github.com/mebeim/systrack

I realized soon in the process that simply looking at kernel sources was not enough to extract everything accurately, specially definition locations. I also wanted this to be a tool to extract syscalls actually implemented from a given kernel image, so that's what it does.

Your approach should be fine, that is what any other language does basically: rely on uapi headers provided by the kernel (just beware that some may be generated at build time inside e.g. include/asm/generated/xxx). You should rely on the headers that are exported when you do `make headers_install`. Also, make sure to have a generic syscall() function that takes an arbitrary syscall number and an arbitrary amount of args to make raw syscalls for the weird ones you don't easily find in uapi headers and you should be good. After all, even in the C library headers some of the "weird" syscalls aren't present sometimes.

saagarjha

0 replies

22h32m

2024-07-20 19:55:44 UTC

This is neat! Finally someone who added all the information I need :)

lsofzz

0 replies

6h37m

2024-07-21 11:51:00 UTC

Thanks! This is great. If you ever need extra housing for this, I would be glad to provide it.

jeffrallen

0 replies

22h24m

2024-07-20 20:04:07 UTC

From this I learned about Landlock, thanks!

greenpenguin

0 replies

7h33m

2024-07-21 10:55:35 UTC

There's a few of these floating around that are generally missing something - recent syscalls, types, etc. Thank you for such a complete one!

davidfiala

0 replies

34m

2024-07-21 17:54:26 UTC

I've wished for a complete version like this for so long. Great work. Thank you!