HN comments for: No More Blue Fridays

mrpippy

36 replies

4h57m

2024-07-22 13:33:37 UTC

Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well.

This doesn’t seem grounded in reality. If you follow the link to the “hooks” that Windows eBPF makes available [1], it’s just for incoming packets and socket operations. IOW, MS is expecting you to use the Berkeley Packet Filter for packet filtering. Not for filtering I/O, or object creation/use, or any of the other million places a driver like Crowdstrike’s hooks into the NT kernel.

In addition, they need to be in the kernel in order to monitor all the other 3rd party garbage running in kernel-space. ELAM (early-launch anti-malware) loads anti-malware drivers first so they can monitor everything that other drivers do. I highly doubt this is available to eBPF.

If Microsoft intends eBPF to be used to replace kernel-space anti-malware drivers, they have a long, long way to go.

[1]: https://microsoft.github.io/ebpf-for-windows/ebpf__structs_8...

shahahqq

19 replies

4h45m

2024-07-22 13:45:27 UTC

I hope though that Microsoft will double down on their eBPF support for Windows after this incident.

stackskipton

11 replies

4h24m

2024-07-22 14:06:21 UTC

Doubt it. Microsoft is clearly over Windows. They continue to produce it but every release feels like "Ugh, fine, since you are paying me a ton of money."

Internally, Microsoft is running more and more workloads on Linux and externally, I've had .Net team tell me more than once that Linux is preferred environment for .Net. SQL Server team continues to push hard for Linux compatibility with every release.

EDIT: Windows Desktop gets more love because they clearly see that as important market. I'm talking more Windows Server.

throwaway2037

4 replies

4h11m

2024-07-22 14:19:48 UTC

This claim about SQL Server: Is it due to disk access being slower from NT kernel compared to Linux kernel?

riskable

1 replies

1h49m

2024-07-22 16:41:10 UTC

I had read previously from an unverified SQL Server engineer that the thing they wanted most (with Linux support) was proper containerization (from a developer perspective). Apparently containers on Windows just don't cut it (which is why nobody uses them in production). Take it with a grain of salt though.

I don't think they'd ever admit that filesystem performance was an issue (though we all know it is; NTFS is over 30 years old!).

shawnz

0 replies

42m

2024-07-22 17:48:55 UTC

though we all know it is; NTFS is over 30 years old!

ext2, which is forwards compatible with ext3 and ext4, is slightly older than NTFS

stackskipton

0 replies

3h53m

2024-07-22 14:36:57 UTC

It's just easier for everyone involved (outside Windows GUI clicker admins) if it runs on Linux. Containerization is easier, configuration is easier and operating system is much more robust.

marcosdumay

0 replies

2h52m

2024-07-22 15:38:39 UTC

There's something very wrong with Windows disk access, you can see it easily by trying to run a Windows desktop with rotating disks.

But SQL Server is in the unique position of being able to optimize Windows for their own needs. So they shouldn't have this kind of problem.

kevincox

4 replies

4h14m

2024-07-22 14:16:41 UTC

They aren't over windows. They continue to be incredibly interested in and actively developing how much money they can suck from their users. Especially via various forms of ads.

But yeah, kernel features are few and far between.

queuebert

2 replies

1h48m

2024-07-22 16:42:19 UTC

I believe the term you are looking for is "rent seeking". Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have? (I'm being generous with XP, because actually 95 was already mostly internet ready.) Yet how many times have many of us paid for a Windows license on a new computer or because the old version stopped getting updates?

recursive

0 replies

40m

2024-07-22 17:50:51 UTC

what new functionality does Windows 11 actually have that Windows XP didn't have? (

Off the top of my head, built-in bluetooth support, an OS-level volume mixer, and more support for a wider variety of class-compliant devices. I'm sure there are a lot more, and if you actually care about the answer, I don't think it would be hard to find.

pcwalton

0 replies

1h9m

2024-07-22 17:21:53 UTC

Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have?

Off the top of my head, limiting myself to just NT kernel stuff: WSL and Hyper-V, pseudo-terminals, condvars, WDDM, DWM, elevated privilege programs on the same desktop, font driver isolation, and limiting access to win32k for sandboxing.

rob74

0 replies

3h59m

2024-07-22 14:31:47 UTC

mosburger

0 replies

3h48m

2024-07-22 14:42:42 UTC

SQL Server team continues to push hard for Linux compatibility with every release.

It's kinda funny that the DB that was once a fork of Sybase that was ported to Windows is trying to make its way back to Unix.

benfortuna

6 replies

4h0m

2024-07-22 14:30:33 UTC

Keep in mind they don't just allow any old code to execute in the kernel.

They do have rigorous tests (WHQL), it's just Crowdstrike decided that was too burdensome for their frequent updates, and decided to inject code from config files (thus bypassing the control).

The fault here is entirely with Crowdstrike.

capitainenemo

3 replies

3h49m

2024-07-22 14:41:37 UTC

Is there any evidence that the config files had arbitrary code in them? The only analysis I'd seen so far indicated a parsing error loading a viral signature database that was routinely updated, but in this case was full of garbage data.

benfortuna

2 replies

3h41m

2024-07-22 14:49:18 UTC

Perhaps not verified, but some smart people do have convincing arguments:

https://youtu.be/wAzEJxOo1ts?si=UNNxAN27VV1E6mcP&t=505

capitainenemo

1 replies

3h25m

2024-07-22 15:05:30 UTC

Any article/blog/text-that-can-be-read?

alecco

0 replies

1h20m

2024-07-22 17:10:15 UTC

Don't bother. He just repeats a tweet saying a null+offset dereference and also the speculation of that null picked from the sys file.

remram

1 replies

2h22m

2024-07-22 16:08:46 UTC

How rigorous are the tests if faulty data can brick the machine?

dwattttt

0 replies

2h10m

2024-07-22 16:20:26 UTC

Not rigorous enough to have detected this flaw in the kernel sensor, although effectively any bug in this situation (an AV driver) can brick a machine. I imagine WHQL isn't able to find every possible bug in a driver you submit to them, they're not your QA team.

brendangregg

13 replies

4h36m

2024-07-22 13:54:20 UTC

Yes, we know eBPF must attach to equivalent events to Linux, but given there are already many event sources and consumers in Windows, the work is to make eBPF another consumer -- not to invent instrumentation frameworks from scratch.

Just to use an analogy: Imagine people do their banking on JavaScript websites with Google Chrome, but if they use Microsoft Edge it says "JavaScript isn't supported, please download and run this .EXE". I'm not sure we'd be asking "if" Microsoft would support JavaScript (or eBPF), but "when."

surajrmal

12 replies

4h24m

2024-07-22 14:06:04 UTC

This assumes eBPF becomes the standard. It's not clear Microsoft wants that. They could create something else which integrates with dot net and push for that instead.

Also this problem of too much software running in the kernel in an unbounded manner has long existed. Why should Microsoft suddenly invest in solving it on Windows?

philistine

6 replies

4h8m

2024-07-22 14:22:29 UTC

Apple took the lead on this front. It has closed easy access to the kernel by apps, and made a list of APIs to try and replace the lost functionality. Anyone maintaining a kernel module on macOS is stuck in the past.

Of course, the target area of macOS is much smaller than Windows, but it is absolutely possible to kick all code, malware and parasitic security services alike, from accessing the kernel.

The safest kernel is the one that cannot be touched at runtime.

nullindividual

2 replies

3h56m

2024-07-22 14:34:30 UTC

I don't think Microsoft has a choice with regards to kernel access. Hell, individuals currently use undocumented NT APIs. I can't imagine what happens to backwards compat if kernel access is closed.

Apple's closed ecosystem is entirely different. They'll change architectures on a whim and users will go with the flow (myself included).

becurious

1 replies

2h39m

2024-07-22 15:51:51 UTC

But Apple doesn’t have the industrial and commercial uses that Linux and Windows have. Where you can’t suddenly switch out to a new architecture without massive amounts of validation costs.

At my previous job they used to use Macs to control scientific instrumentation that needed a data acquisition card. Eventually most of the newer product lines moved over to Windows but one that was used in a validated FDA regulated environment stayed on the Mac. Over time supporting that got harder and harder: they managed through the PowerPC to Intel transition but eventually the Macs with PCIe slots went away. I think they looked at putting the PCIe card in a Thunderbolt enclosure. But the bigger problem is guaranteeing supply of a specific computer for a reasonable amount of time. Very difficult to do these days with Macs.

nullindividual

0 replies

2h2m

2024-07-22 16:28:15 UTC

validated FDA regulated environment stayed on the Mac

Given how long it takes to validate in a GxP environment, and the cost, this makes sense.

Xunjin

2 replies

4h4m

2024-07-22 14:26:16 UTC

The safest kernel is the one that cannot be touched at runtime.

Can you expand what you mean here? Because depending on the application you are running, you will need at least talk with some APIs to get privileged access?

odo1242

0 replies

2h7m

2024-07-22 16:23:37 UTC

Yeah, Apple doesn’t allow any user code to run in kernel mode without significant hoops (the kernel is code signed) and tries to provide a user space API (e.g. DriverKit) as an alternative for the missing functionality.

Some things (FUSE) are still annoying though.

Agingcoder

0 replies

37m

2024-07-22 17:53:09 UTC

Being allowed to talk to the kernel to get info and running with the same privileges ( basically being able to read / write any memory ) is different.

brendangregg

2 replies

4h19m

2024-07-22 14:11:39 UTC

Microsoft have been driving the work to make eBPF an IETF industry standard.

riskable

1 replies

1h58m

2024-07-22 16:32:22 UTC

...just like they did with Kerberos! And just like with Kerberos they'll define a standard then refuse to follow it. Instead, they will implement subtle changes to the Windows implementation that make solutions that use Windows eBPF incompatible with anything else, making it much more difficult to write software that works with all platforms eBPF (or even just its output).

Everything's gotta be different in Windows land. Otherwise, migrating off of Windows land would be too easy!

In case you were wondering what Microsoft refused to implement with its Kerberos implementation it's the DNS records. Instead of following the standard (they wrote!) they decided that all Windows clients will use AD's Global Catalog to figure out which KDC to talk to (e.g. which one is "local" or closest to the client). Since nothing but Windows uses the Global Catalog they effectively locked out other platforms from being able to integrate with Windows Kerberos implementation as effectively (it'll still work, just extremely inefficiently as the clients won't know which KDC is local so you either have to hard-code them into the krb5.conf on every single device/server/endpoint and hope for the best or DNS-and-pray you don't get a Domain Controller/KDC that's on an ISDN line in some other country).

MawKKe

0 replies

51m

2024-07-22 17:39:22 UTC

Embrace, extend, ...

wongarsu

0 replies

43m

2024-07-22 17:47:53 UTC

Microsoft has invested in solving this for at least two decades, probably longer. They are just using a different (arguably worse) approach to this than the Unix world.

In Windows 9x anti-malware would just run arbitrary code in the kernel that hooked whatever it wanted. In Windows XP a lot of these things got proper interfaces (like the file system filter drivers to facilitate scanning files before they are accessed, later replaced by minifilters), and the 64 bit edition of XP introduced PatchGuard [1] to prevent drivers from modifying Microsoft's kernel code. Additionally Microsoft is requiring ever more static and dynamic analysis to allow drivers to be signed (and thus easily deployed).

This is a very leaky security barrier. Instead of a hardware-enforced barrier like the kernel-userspace barrier it's an effort to get software running at the same protection level to behave. PatchGuard is a cat-and-mouse game Microsoft is always loosing, and the analysis mostly helps against memory bugs but can't catch everything. But MS has invested a lot of work over the years in attempts to make this path work. So expecting future actions isn't unreasonable.

[1] https://en.wikipedia.org/wiki/Kernel_Patch_Protection

numbsafari

0 replies

1h23m

2024-07-22 17:07:01 UTC

Why should Microsoft suddenly invest in solving it on Windows?

If they can continue to avoid commercial repercussions for failing to provide a stable and secure system, then society should begin to hold them to account and force them to.

I’m not necessarily advocating for eBPF here, either. If they want to get there through some “proprietary” means, so be it. Apple is doing much the same on their end by locking down kexts and providing APIs for user mode system extensions instead. If MS wants to do this with some kind of .net-based solution (or some other fever dream out of MSR) then cool. The only caveat would seem to be that they are under a number of “consent decree” type agreements that would require that their own extensions be implemented on a level playing field.

So what. Windows Defender shouldn’t be in the kernel any more than CrowdStrike. Add an API. If that means being able to send eBPF type “programs” into kernel space, cool. If that means some user mode APIs, cool.

But lock it down already.

nullindividual

1 replies

3h58m

2024-07-22 14:32:25 UTC

Microsoft already has an extensible file system filter capability in place, which is what current AV uses. Does it make sense to add eBPF on top of that and if so, are there any performance downsides, like we see with file system filters?

mauvehaus

0 replies

2h49m

2024-07-22 15:41:25 UTC

They've done a technology transition once already from legacy file system filter drivers to the minifilter model. If they see enough benefit to another change, it wouldn't be unprecedented.

Mind you, it looks like after 20-ish years Windows still supports loading legacy filter drivers. Given the considerable work that goes into getting even a simple filesystem minifilter driver working reliably, it's safe to assume that we'd be looking at a similarly protracted transition period.

As to the performance, I don't think the raw infrastructure to support minifilters is the major performance hit. The work the drivers themselves end up doing tends to be the bigger hit in my experience.

Some background for the curious:

https://www.osr.com/nt-insider/2019-issue1/the-state-of-wind...

shrx

35 replies

5h8m

2024-07-22 13:22:44 UTC

From the article:

If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).

[0] links to https://github.com/torvalds/linux/blob/master/kernel/bpf/ver... which has this interesting comment at the top:

    /* bpf_check() is a static code analyzer that walks eBPF program
     * instruction by instruction and updates register/stack state.
     * All paths of conditional branches are analyzed until 'bpf_exit' insn.
     *
     * The first pass is depth-first-search to check that the program is a DAG.
     * It rejects the following programs:
     * - larger than BPF_MAXINSNS insns
     * - if loop is present (detected via back-edge)
    ...

I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?

skywhopper

12 replies

5h4m

2024-07-22 13:26:50 UTC

If the verifier can't determine that the loop will halt, the program is disallowed. Also, if the program gets passed and then runs too long anyway, it's force-halted. So... I guess that solves the halting problem.

lucianbr

10 replies

4h38m

2024-07-22 13:52:30 UTC

So this "solves" the halting problem by creating a new class "might-not-halt-but-not-sure" and lumping it with "does-not-halt". I find it hard to believe the new class is small enough for this to be useful, in the sense that it will avoid all kernel crashes.

I rather expect useful or needed code would be rejected due to "not-sure-it-halts", and then people will use some kind of exception or not use the verifier at all, and then we are back to square one.

umanwizard

8 replies

4h33m

2024-07-22 13:57:08 UTC

Well it is useful in practice, there are some pretty useful products based on eBPF on Linux, most notably Cilium (and, shameless plug for the one I’m working on: Parca, an eBPF-based CPU profiler).

lucianbr

7 replies

4h28m

2024-07-22 14:02:02 UTC

Bad wording on my part, and I still don't know how to word it better. I'm sure this thing is useful, I don't think everyone who contributed code was just clueless.

However, the claim "in the future, computers will not crash due to bad software updates, even those updates that involve kernel code" must be false. There is no way it is true. Whatever Cilium is, I cannot believe it generally prevents kernel crashes.

umanwizard

4 replies

4h25m

2024-07-22 14:05:15 UTC

Correct, you will never be able to write any possible arbitrary code and have it run in eBPF. It necessarily constrains the class of programs you can write. But the constrained set is still quite useful and probably includes the crowdstrike agent.

Also, although this isn't the case now, it's possible to imagine that the verifier could be relaxed to allow a Turing-complete subset of C that supports infinite loops while still rejecting sources of UB/crashes like dereferencing an invalid pointer. I suspect from reading this post that that is the future Mr. Gregg has in mind.

Whatever Cilium is, I cannot believe it generally prevents kernel crashes.

It doesn't magically prevent all kernel crashes from unrelated code. But what we can say is that Cilium itself can't crash the kernel unless there are bugs in the eBPF verifier.

lucianbr

3 replies

4h13m

2024-07-22 14:16:57 UTC

If the verifier allowed a Turing-complete language, it would solve the halting probem, which is impossible.

umanwizard

2 replies

4h8m

2024-07-22 14:22:00 UTC

My point is that the verifier could be relaxed to accept programs that never halt, thus not needing to solve the halting problem. You could then have the kernel just kill it after running over a certain maximum amount of time.

lucianbr

1 replies

2h27m

2024-07-22 16:03:51 UTC

Why do you think the kernel crashes when crowdstrike attempts to reference some unavailable address (or whatever it does) instead of just denying that operation and continuing on? That would be the solution using this philosophy "just kill long running program". And no need for eBPF or anything complicated. But it doesn't work that way in practice.

This is just such a naive view. "We can prevent programs from crashing by just taking care to stop them when they do bad things". Well, sure, that's why you have a kernel and userland. But it turns out, some things need to run in the kernel. Or "just deny permission". Then it turns out some programs need to run as admin. And so on.

There is a generality in the halting problem, and saying "we'll just kill long runing programs" just misses the point entirely.

Likely what will happen is that you will kill useful long-running programs, then an exception mechanism will be invented so some programs will not be killed, because they need to run longer, then one of those programs will go into an infinite loop despite all your mechanisms preventing it. Just like the crowdstrike driver managed to bring down the OS despite all the work that is supposed to prevent the entire computer crashing if a single program tries something stupid.

umanwizard

0 replies

1h51m

2024-07-22 16:39:41 UTC

Why do you think the kernel crashes when crowdstrike attempts to reference some unavailable address (or whatever it does) instead of just denying that operation and continuing on?

Linux and windows are completely monolithic kernels; the crowdstrike agent isn't running in a sandbox and has complete unfettered access to the entire kernel address space. There is no separate "the kernel" to detect when the agent does something wrong; once a kernel module is loaded, IT IS the kernel.

Lots of people have indeed realized this is undesirable and that there should be a sandboxed way to run kernel code such that bugs in it can't cause arbitrarily bad undefined behavior. Thus they invented eBPF. That's precisely what eBPF is.

I don't know whether it's literally true that someday you will be able to write all possibly useful kernel-mode code in eBPF. But the spirit of the claim is true: there's a huge amount of useful software that could be written in eBPF today on Linux instead of as kernel modules, and this includes crowdstrike. Thus Windows supporting eBPF, and crowdstrike choosing to use it, would have solved this problem. That set of software will increase as the eBPF verifier is enhanced to accept a wider variety of programs.

Just like you can write pretty much any useful program in JavaScript today -- a sandboxed language.

You're also correct that due to the halting problem, we'll either have to accept that eBPF will never be Turing complete, OR accept that some eBPF programs will never halt and deal with the issues in other ways. Just like Chrome's JavaScript engine has to do. I don't really view this as a fundamentally unsolvable issue with the nature of eBPF.

tptacek

1 replies

4h24m

2024-07-22 14:06:50 UTC

The claim isn't that eBPF generally prevents kernel crashes. It's that it prevents crashes in the subset of programs it's designed for, in particular for instrumentation, which Crowdstrike is (in this author's conception) an instance of.

lucianbr

0 replies

4h16m

2024-07-22 14:14:36 UTC

I have quoted the claim verbatim from the article. It is obviously the claim of the article.

tptacek

0 replies

4h25m

2024-07-22 14:05:44 UTC

Lots of useful code is rejected due to "not-sure-it-halts". That's the premise.

neaanopri

0 replies

4h58m

2024-07-22 13:32:07 UTC

It's more accurate to say that in principle, there could be programs that would halt, but that the verifier will deny.

red_admiral

3 replies

4h11m

2024-07-22 14:18:58 UTC

eBPF is not Turing-complete, I suppose.

lizxrice

1 replies

2h59m

2024-07-22 15:31:40 UTC

In this talk we demo Conway's Game of Life implemented in eBPF: https://www.youtube.com/watch?v=tClsqnZMN6I

lizxrice

0 replies

1h44m

2024-07-22 16:46:12 UTC

I should clarify that individual eBPF programs have to terminate, but more complex problems can be solved with multiple eBPF programs, and can be "scheduled" indefinitely using BPF timers

javierhonduco

0 replies

2h47m

2024-07-22 15:43:11 UTC

It is not, programs that are accepted are proved to terminate. Large and more complex programs are accepted by BPF as of now, which might give the impression that it's now Turing complete, when it is definitely not the case.

lolinder

3 replies

4h26m

2024-07-22 14:04:45 UTC

I'm not able to comment on what this code is doing, but as for the theory:

The halting problem is only unsolvable in the general case. You cannot prove that any arbitrary piece of code will stop, but you can prove that specific types of code will stop and reject anything that you're unable to prove. The trivial case is "no jumps"—if your code executes strictly linearly and is itself finite then you know it will terminate. More advanced cases can also be proven, like a loop over a very specific bound, as long as you can place constraints on how the code can be structured.

As an example, take a look at Dafny, which places a lot of restrictions on loops [0], only allowing the subset that it can effectively analyze.

[0] https://ece.uwaterloo.ca/~agurfink/stqam/rise4fun-Dafny/#h25

jkrejcha

2 replies

2h2m

2024-07-22 16:28:52 UTC

Adding on (and it's not terribly relevant to eBPF), it's also worth noting that there are trivial programs you can prove DON'T halt.

A trivial example[1]:

    int main() {
        while (true) {}
        int x = foo();
        return x;
    }

This program trivially runs forever[2], and indeed many static code analyzers will point out that everything after the `while (true) {}` line is unreachable.

I feel like the halting problem is incredibly widely misunderstood to be similar to be about "ANY program" when it really talks about "ALL programs".

[1]: In C++, this is undefined behavior technically, but C and most other programming languages define the behavior of this (or equivalent) function.

[2]: Fun relevant xkcd: https://xkcd.com/1266/

fwip

1 replies

1h2m

2024-07-22 17:28:17 UTC

Nit: In many languages, doesn't this depend on what foo() does? e.g:

  foo() {
    exit(0);
  }

loeg

0 replies

38m

2024-07-22 17:52:17 UTC

No? The foo() invocation is never reached because the while loop never terminates.

aksdlf

3 replies

4h49m

2024-07-22 13:41:44 UTC

I'm glad to hear that Meta and Google code is "rigorous". I'd prefer INRIA, universities that fund theorem provers, industries where correctness matters like aerospace or semiconductors.

chc4

0 replies

4h30m

2024-07-22 14:00:49 UTC

Windows doesn't use the Linux eBPF verifier, they have their own implementation named PREVAIL[0] that is based on an abstract interpretation model that has formal small step semantics. The actual implementation isn't formally proven, however.

0: https://github.com/vbpf/ebpf-verifier

auspiv

0 replies

4h1m

2024-07-22 14:29:21 UTC

Correctness as defined by Boeing? Or another definition?

"The Maneuvering Characteristics Augmentation System (MCAS) is a flight stabilizing [software] feature developed by Boeing that became notorious for its role in two fatal accidents of the 737 MAX in 2018 and 2019, which killed all 346 passengers and crew among both flights."

https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Au...

"The Boeing Orbital Flight Test (OFT) was an uncrewed orbital flight test launched on December 20, 2019, but after deployment, an [incorrect] 11-hour offset in the mission clock of Starliner caused the spacecraft to compute that "it was in an orbital insertion burn", when it was not. This caused the attitude control thrusters to consume more fuel than planned, precluding a docking with the International Space Station.[79][80]"

[79] https://spacenews.com/starliner-suffers-off-nominal-orbital-... "Starliner suffers "off-nominal" orbital insertion after launch". SpaceNews. December 20, 2019. Archived from the original on June 6, 2024. Retrieved December 20, 2019.

[80] https://www.cnbc.com/2019/12/20/boeings-starliner-flies-into... Sheetz, Michael (December 20, 2019). "Boeing Starliner fails mission, can't reach space station after flying into wrong orbit". CNBC. Archived from the original on February 8, 2021. Retrieved December 20, 2019.

SoftTalker

0 replies

2h54m

2024-07-22 15:36:19 UTC

Also that lines of code is a proxy for rigor, something new I learned today. /s

pkhuong

1 replies

5h3m

2024-07-22 13:27:32 UTC

The basic logic flags any loop ("back-edge").

rezonant

0 replies

1h18m

2024-07-22 17:12:19 UTC

This, others have said it less concisely, but a program without loops and arbitrary jumps is guaranteed to halt if we assume the external functions it calls into will halt.

atrus

1 replies

5h2m

2024-07-22 13:28:06 UTC

The halting problem is exhaustive, there isn't an algorithm that is valid for all programs. You can still check for some kinds of infinite loops though!

roywiggins

0 replies

4h38m

2024-07-22 13:52:29 UTC

More specifically, you can accept a set of programs that you are certain do halt, and reject all others, at the expense of rejecting some that will halt. As long as that set is large enough to be practical, the result can be useful. If you eg forbid code paths that jump "backwards", you can't really loop at all. Or require loops to be bounded by constants.

umanwizard

0 replies

4h36m

2024-07-22 13:54:22 UTC

eBPF is not Turing complete. Writing it is very annoying compared to writing normal C code for exactly this reason.

hiddencost

0 replies

5h5m

2024-07-22 13:25:50 UTC

Unterminated loops might be a better phrasing.

efee22

0 replies

5h4m

2024-07-22 13:26:48 UTC

Infinite loops are not possible and would get rejected by the verifier since it cannot solve the halting problem. Here is a good overview on the options available: https://ebpf-docs.dylanreimerink.nl/linux/concepts/loops/

dtx1

0 replies

5h5m

2024-07-22 13:25:27 UTC

I have no insight into this particular project but you could work around the halting problem by only allowing loops you can proof will not go infinite. That would of course imply rejecting loops that won't go infinite but can't be proven not to.

dathinab

0 replies

4h18m

2024-07-22 14:12:13 UTC

the halting problem is only true for _arbitrary_ programs

but there are always sets of programs for which it is clearly possible to guarantee their termination

e.g. the program `return 1+1;` is guaranteed to halt

e.g. given program like `while condition(&mut state) { ... }` with where `condition()` is guaranteed to halt but otherwise unknown is not guaranteed to halt, but if you turn it into `for _ in 0..1000 { if !condition(&mut state) { break; } ... }` then it is guaranteed to halt after at most 1000 iterations

or in other words eBPF only accepts programs which it can proof will halt in at most maxins "instruction" (through it's more strict then my example, i.e. you would need to unroll the for-loop to make it pass validation)

the thing with programs which are provable halting is that they tend to also not be very convenient to write and/or quite limited in what you can do with them, i.e. they are not suitable as general purpose programming languages at all

Retr0id

0 replies

4h36m

2024-07-22 13:54:49 UTC

The halting problem cannot be solved in the general case, but in many cases you can prove that a program halts. eBPF only allows verifiably-halting programs to run.

asynchronous

22 replies

5h33m

2024-07-22 12:57:24 UTC

Is there a reason for the lack of naming+shaming Crowdstrike in this blogpost? Was it to not give them any more publicity, good or bad?

StevenWaterman

18 replies

5h31m

2024-07-22 12:59:20 UTC

If you consider kernel programming to be inherently unsafe, then you would consider this to be inevitable, meaning it's not really the specific company's fault. They were just the unlucky ones.

lordnacho

13 replies

5h22m

2024-07-22 13:08:01 UTC

They could have helped their luck by doing some of the common sense things suggested in the article.

For instance, why not find a subset of your customers that are low risk, push it out to them, and see what happens? Or perhaps have your own fleet of example installations to run things on first. None of which depends on any specific technology.

hello_moto

10 replies

4h45m

2024-07-22 13:45:29 UTC

"find a subset of low risk customers" and use them as test subject?

Repeat that a few times to understand the repercussions.

If I were the customers and I found out that I was used as test subject, how would I feel?

whynotminot

6 replies

4h34m

2024-07-22 13:56:26 UTC

Canary deployments are already an industry accepted practice and it’s shocking Crowdstrike apparently doesn’t do them.

hello_moto

5 replies

4h25m

2024-07-22 14:05:46 UTC

Which industry? Cybersecurity or Cloud software?

whynotminot

4 replies

2h57m

2024-07-22 15:33:16 UTC

Any industry that wants to reliably deliver software that doesn’t brick systems at scale? I’m confused by your question.

Are you telling me the cybersecurity scene is special and shouldn’t follow best practices for software deployment?

hello_moto

3 replies

2h42m

2024-07-22 15:48:07 UTC

Canary deployment for subset of Salesforce customers won't see much of revolt from customers compare to AV definition rollout (not software, but AV definition) in Cybersecurity where gaps between 0day and rollout means you're exposed.

If customers found out that some are getting roll out faster than the others, essentially splitting the group into 2, there will be a need for customer opt-in/opt-out.

If everyone is opting-out because of Friday, your Canary deployment becomes meaningless.

Any proof that other Cybersecurity vendors do Canary deployment for their AV definition? :)

PS: not to say that the company should test more internally...

whynotminot

2 replies

2h34m

2024-07-22 15:56:26 UTC

Canary deployment doesn’t necessarily mean massive gaps between deployment waves. You can fast-follow. Sure, there may be scenarios with especially severe vulnerabilities where time is of the essence. I’m out of the loop if this crowdstrike update was such a scenario where best practices for software deployment were worth bypassing.

If this is just how they roll with regular definition updates, then their deployment practices are garbage and this kind of large scale disaster was inevitable.

hello_moto

1 replies

2h16m

2024-07-22 16:14:22 UTC

Let's walk this through: Canary deployment to Windows machines. If those Windows machines got hit with BSOD, they will go offline. How do you determine if they go offline because of Canary or because of regular maintenance by the customer's IT cycle?

You can guess, but you cannot be 100% sure.

What if the targeted canary deployments are Employees desktops that are OFFLINE during the time of rollout?

I’m out of the loop if this crowdstrike update was such a scenario where best practices for software deployment were worth bypassing.

I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?

Here's more context to understand Cybersecurity: https://radixweb.com/blog/what-is-mean-time-to-detect

Cybersecurity companies participate in Sec evaluation annually that evaluates (measure) and grade their performance. That grade is an input for Organizations to select vendors outside their own metrics/measurements.

I don't know if MTTD is included in the contract/SLA. If it does, you got some answer as to why certain decision is made.

It's definitely interesting to see Software developers of HN giving out their 2c for a niche Cybersecurity industry.

whynotminot

0 replies

1h21m

2024-07-22 17:09:26 UTC

You can guess, but you cannot be 100% sure.

I worked in the cyber security space for a decent chunk of my career, and the most frustrating part was cyber security engineers thinking their problems were unique and being completely unaware of the lessons software engineering teams have already learned.

Yes, you need to tune your canary deployment groups to be large and diverse enough to give a reliable indicator of deployment failure, while still keeping them small enough that they achieve their purpose of limiting blast radius.

Again, if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.

I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?

I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?

lordnacho

2 replies

4h23m

2024-07-22 14:07:11 UTC

If I were the customers and I found out that I was used as test subject, how would I feel?

In reality, every business has relationships that it values more than others. If I wasn't paying a lot for it, and if I was running something that wasn't critical (like my side project) then why not? You can price according to what level of service you want to provide.

hello_moto

1 replies

4h17m

2024-07-22 14:13:06 UTC

Customers will ask to opt-out.

ahtihn

0 replies

2h28m

2024-07-22 16:02:44 UTC

Customers will pay to opt out.

gtsop

1 replies

4h32m

2024-07-22 13:58:30 UTC

Why even do that? We have virtualization, they could emulate real clients and networks of clients. This particular bug would have been prevented for sure

lordnacho

0 replies

4h27m

2024-07-22 14:03:34 UTC

Yeah I thought maybe the VM thing might not catch the bug for some reason, but it seems like the natural thing to do. Spin up VM, see if there's a crash. I heard the technical reason had something to do with a file being full of nulls, but that sort of thing you should catch.

Honestly, the most generous excuse I can think of is that CS were informed of some sort of vulnerability that would have profound consequences immediately, and that necessitated a YOLO push. But even that doesn't seem too likely.

asynchronous

1 replies

5h23m

2024-07-22 13:07:26 UTC

I still hold true that testing even improperly would have caught this before it hit worldwide. But I suppose you are right, that doesn’t help the argument being made here.

ForOldHack

0 replies

5h14m

2024-07-22 13:16:25 UTC

Wasnt that the job of AI/co-pilot/clippy /D.E.P? "Would you like me to try and execute a random blank file?"

And of course QA.

I was unaffected, but was fielding calls from customers.

My update Tuesday is the week after, so in-between MS and my updates, I am very suspicious of everything.

I was also unaffected by 22H2, and spent time fielding calls.

efee22

0 replies

5h24m

2024-07-22 13:06:47 UTC

Agree, Crowdstrike was an unlucky one, but it is more about the issue in general. If I remember correctly, also others like sysdig user their own kernel modules for collection.

brendangregg

0 replies

5h20m

2024-07-22 13:10:52 UTC

Right, and we wanted to talk about all security solutions and not make this about one company. We also wanted to avoid shaming since they have been seriously working on eBPF adoption, so in that regard they are at the forefront of doing the right thing.

hiddencost

2 replies

5h4m

2024-07-22 13:26:53 UTC

I think the article isn't about crowd strike. It's about ebpf.

pimlottc

1 replies

4h39m

2024-07-22 13:51:43 UTC

The second paragraph is 100% about Crowdstrike. It even links to the Wikipedia article:

https://en.m.wikipedia.org/wiki/2024_CrowdStrike_incident

hiddencost

0 replies

4h30m

2024-07-22 13:59:59 UTC

CrowdStrike is mentioned, but the goal of the article is to promote eBPF. CrowdStrike is tangentially related because it draws attention to a platform that Gregg has put a lot into.

kayo_20211030

8 replies

5h15m

2024-07-22 13:15:36 UTC

This isn't right. If I need a system to run with a piece of code, then it shouldn't run at all if that piece of code is broken. Ignoring the failure is perverse. Let's say that the driver code ensures that some medical machine has safety locks (safeguards) in place to make sure that piece of equipment won't fry you to a crisp; I'd prefer that the whole thing not run at all rather than blithely operate with the safeguards disabled. It's turtles all the way down.

phartenfeller

2 replies

4h43m

2024-07-22 13:47:00 UTC

The medical machine software should just refuse to run with an error message if a critical driver was not loaded. The OS bricking is causing way more trouble where an IT technician now needs to fix something where it otherwise would just be updating the faulty driver... Also does your car not start if you are missing water for the wiper?

jve

1 replies

4h31m

2024-07-22 13:59:06 UTC

Water for the wiper is userland feature.

3rd party hooking into kernel is 3rd party responsibility. It is like equipping your car with LPG - THAT hooks into engine (kernel). And When I had a faulty gas pressure sensor then my car actually halted (BSOD if you will) instead of automatically failing over to gasoline as it is by design.

You can argue that car had no means to continue execution but kernel has, however invalid kernel state can cause more corruption down the road. Or as parent even points out - carry out lethal doses of something.

pinebox

0 replies

3h54m

2024-07-22 14:36:28 UTC

Initially I was inclined to disagree ("these things should always fail safe") however with more and more stuff being pushed into the kernel it's hard to say that you're wrong or exactly where a line needs to be drawn between "minimally functional system" and "dangerously out of control system".

I think until we discover a technology that forces commercial software vendors to employ functioning QA departments none of this will really solve anything.

Smaug123

1 replies

4h52m

2024-07-22 13:38:10 UTC

I think the premise is false? It's up to the eBPF implementor what to do in the case of invalid input; the kernel could choose to perform a controlled shutdown in that case. (I have no idea what e.g. Linux actually does here, but one could imagine worlds where the action it takes on invalid input is configurable.)

Also your statement is sometimes not true, although I certainly sympathise in the mainline case. In some contexts you really do need to keep on trucking. The first example to spring to mind is "the guidance computers on an automated Mars lander"; the round-trip to Earth is simply too long to defer responsibility in that case. If you shut down then you will crash, but if you do your best from a corrupted state then you merely probably crash, which is presumably better.

umanwizard

0 replies

4h27m

2024-07-22 14:03:09 UTC

I have no idea what e.g. Linux actually does here

If you attempt to load an eBPF program that the verifier rejects, the syscall to load it fails with EINVAL or E2BIG. What your user-space program then does is up to you, of course.

enragedcacti

0 replies

1h6m

2024-07-22 17:24:55 UTC

I agree that some system components should be treated as critical no matter what, but the software at issue in this case (Falcon Sensor or Antivirus more generally) is precautionary and only best effort anyways. I would wager the vast majority of the orgs affected on Friday would have preferred the marginally increased risk of a malware attack or unauthorized use over a 24 hour period instead of the total IT collapse they experienced. Further, there's no reason the bug HAD to cause a BSOD, it's possible the systems could have kept on trucking but with an undefined state and limitless consequences. At least with eBPF you get to detect a subset of possible errors and make a risk management decision based on the result.

__MatrixMan__

0 replies

3h17m

2024-07-22 15:13:31 UTC

I like how Unison works for this reason. You call functions by cryptographic hash, so you have some assurance that you're calling the same function you called yesterday.

Updates would require the caller to call different functions which means putting the responsibility in the hands of the caller, where it should be, instead of on whoever has a side channel to tamper with the kernel.

You end up with the work-perfectly-or-not-at-all behavior that you're after because if the function that goes with the indicated hash is not present, you can't call it, and if it is present you can't call it in any way besides how it was intended

ChrisMarshallNY

0 replies

3h24m

2024-07-22 15:06:35 UTC

> Ignoring the failure is perverse.

If the failed system is a security module, I think that's absolutely correct. If the system runs, without the security module, well, that's like forgetting to pack condoms on Shore Leave. You'll likely be bringing something back to the ship with you.

Someone needs to be testing the module, and the enclosing system, to make sure it doesn't cause problems.

I suspect that it got a great deal of automated unit testing, but maybe not so much fuzz and monkey (especially "Chaos Monkey"-style) testing.

It's a fuzzy, monkey-filled world out there...

xg15

7 replies

5h39m

2024-07-22 12:51:17 UTC

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Assuming every security critical system will be on a recent enough kernel to support this...

dijit

3 replies

5h23m

2024-07-22 13:07:03 UTC

And assuming there's no bugs in the BPF code...

Oh wait: https://news.ycombinator.com/item?id=41031699

efee22

2 replies

5h17m

2024-07-22 13:13:39 UTC

RHEL kernel.. right. Imho, I'd trust an upstream stable kernel far more than a RHEL one for production which has dozen of feature backports and an internal kABI to maintain.. granted RH has a QA team, but it is still impossible to test everything beforehand.

worthless-trash

1 replies

4h54m

2024-07-22 13:36:24 UTC

On the upside, non root users can't insert ebpf code, so its a priv'ed operation, not like other distros.

nequo

0 replies

4h12m

2024-07-22 14:18:19 UTC

Isn’t it tied to CAP_BPF on every distro since the 5.8 kernel?

https://mdaverde.com/posts/cap-bpf/

efee22

1 replies

5h26m

2024-07-22 13:04:53 UTC

I think with a LTS distribution you should get very far these days when it comes to implementing such sensors.

chasil

0 replies

3h53m

2024-07-22 14:37:11 UTC

On rhel8 variants, you can use the Oracle UEK to get eBPF.

https://blogs.oracle.com/linux/post/oracle-linux-and-bpf

  $ cat /etc/redhat-release /etc/oracle-release /proc/version
  Red Hat Enterprise Linux release 8.10 (Ootpa)
  Oracle Linux Server release 8.10
  Linux version 5.15.0-203.146.5.1.el8uek.x86_64 (mockbuild@host-100-100-224-48) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9.2.0.1), GNU ld version 2.36.1-4.0.1.el8_6) #2 SMP Thu Feb 8 17:14:39 PST 2024

dredmorbius

0 replies

4h25m

2024-07-22 14:05:53 UTC

Considering the number of systems running very obsolete OSes these days: WinNT (4x or 3x), Windows, DOS, or various proprietary Unixen, stale Linux flavours, etc., etc., ... yes, quite.

throwaway2037

7 replies

4h13m

2024-07-22 14:17:46 UTC

The blog post says:

    > eBPF, which is immune to such crashes.

I tried to Google about this, but I cannot find anything definitive. It looks like you can still break things. Can an expert on eBPF please comment on this claim? This is the best that I could find: https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...

umanwizard

6 replies

3h32m

2024-07-22 14:58:00 UTC

eBPF programs cannot crash the kernel, assuming there are no bugs in the eBPF verifier. There have been such bugs in the past but they seem to be getting more and more rare.

rwmj

1 replies

2h36m

2024-07-22 15:54:12 UTC

This isn't really true. eBPF programs in Linux have access to a large set of helper functions written in plain C. https://lwn.net/Articles/856005/

umanwizard

0 replies

2h3m

2024-07-22 16:27:04 UTC

I don't see how this contradicts what I said. Indeed, there are helpers, but the verifier is supposed to check that the eBPF program isn't calling them with invalid arguments.

queuebert

1 replies

1h45m

2024-07-22 16:45:20 UTC

I would be very hesitant to say "cannot" in a million-line C code base.

umanwizard

0 replies

1h43m

2024-07-22 16:47:31 UTC

Yes, bugs in Linux are possible, so there might be some eBPF code that crashes the kernel. Just like bugs in Chrome are possible, so there might be some JavaScript that crashes the browser. Still, JavaScript is much safer than native code, because fixing the bugs in one implementation is a tractable problem, whereas fixing the bugs in all user code is not.

javierhonduco

1 replies

2h41m

2024-07-22 15:49:39 UTC

Or in other parts of the kernel. It's been the case in multiple occasions that buggy locking (or more generalised, missing 'resource' release) has caused problems for perfectly safe BPF programs. For example, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398 and the fix https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

umanwizard

0 replies

1h42m

2024-07-22 16:48:39 UTC

This is actually exactly the bug I was thinking of, so fair point! (I work at PS now and am aware you worked on debugging it a while back).

CodeWriter23

7 replies

4h26m

2024-07-22 14:04:19 UTC

an unprecedented example of the inherent dangers of kernel programming

I take issue with that. Kernel programming was not to blame; looking up addresses from a file and accessing those memory locations without any validation is. The same technique would yield the same result at any Ring.

nine_k

2 replies

4h23m

2024-07-22 14:07:46 UTC

At Ring 3 it would crash an app, not the entire OS.

Yes, the kernel is fine and is not to blame. But running basically a rootkit controlled by a third party indeed is to blame.

CodeWriter23

1 replies

4h17m

2024-07-22 14:13:42 UTC

At Ring 3 it would crash an app, not the entire OS.

That's still an outage for those key systems.

nequo

0 replies

3h36m

2024-07-22 14:54:44 UTC

It is an outage for the monitoring system, not the system that it monitors.

lucianbr

2 replies

4h25m

2024-07-22 14:05:27 UTC

Obviously in userspace it would only crash the running program and not the entire operating system? It's a significant difference.

All of the service interruptions would have been just "computer temporarily not protected by crowdstrike agent". Not the same thing at all.

CodeWriter23

1 replies

4h15m

2024-07-22 14:14:58 UTC

It's a significant difference.

When various apps running the world are crashing, unable to execute because malware protection is failing, there is no difference.

macobrien

0 replies

3h38m

2024-07-22 14:52:10 UTC

_No_ difference oversells it, IMO -- the fact that the entire OS crashed is what made fixing the bug so arduous, since it required in-person intervention. To be sure, running the code in userspace would still cause unacceptable service interruptions, but the fix could be applied remotely.

dwattttt

0 replies

1h57m

2024-07-22 16:33:22 UTC

FWIW their configuration files can't be holding addresses; those have been randomised in the kernel for at least a decade

skywhopper

6 replies

5h5m

2024-07-22 13:25:44 UTC

The implicit assumption of the article is that eBPF code can't crash a kernel, but the article itself eventually admits that it can and has done, including last month. eBPF is a safer way of providing kernel-extension functionality, for sure, but presenting it as the perfect solution is just asking to have your argument dismissed. eBPF is not perfect. And there's plenty of things it can't do. The very sandbox rules that limit how long its programs may run and what they can do also make it entirely inappropriate for certain tasks. Let's please stop pretending there's a silver bullet.

lucianbr

3 replies

4h34m

2024-07-22 13:56:16 UTC

It's casually claiming to have solved the halting problem, at least within some limited but useful context. That should be impossible, and it turns out, it is.

I expect it can be solved within some limited contexts, but those contexts are not useful, at least not at the level of "generic kernel code".

michaelt

1 replies

3h56m

2024-07-22 14:33:56 UTC

eBPF started out as Berkeley Packet Filters. People wanted to be able to set up complex packet filters. Things like 'udp and src host 192.168.0.3 and udp[4:2]=0x0034 and udp[8:2]=0x0000 and udp[12]=0x01 and udp[18:2]=0x0001 and not src port 3956'

So BPF introduced a very limited bytecode, which is complex enough that it can express long filters with lots of and/or/brackets - but which is limited enough it's easy to check the program terminates and is crash-free. It's still quite limited - prior to ~2019, all loops had to be fully unrolled at compile time as the checker didn't support loops.

It turned out that, although limited, this worked pretty well for filtering packets - so later, when people wanted a way to filter all system calls they realised they could extend the battle-tested BPF system.

Nobody is claiming to have solved the halting problem.

lucianbr

0 replies

2h46m

2024-07-22 15:44:00 UTC

Did you read the article? It says computers will not crash in the future due to updates. It literally says that in the very first line of the article.

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

What you are claiming is completely different. A kind of "firewall" for syscalls. But updates to drivers and software must contain code and data. The author is not talking about updates to the firewall between drivers and the kernel, they talk about updating drivers themselves. It literally says "updates that involve kernel code". Will the kernel only consist of eBPF filtering bytecode? How could that possibly work?

red_admiral

0 replies

4h8m

2024-07-22 14:21:59 UTC

It solves the halting problem by not being Turing complete. I presume each eBPF runs in a context with bounded memory, requested up front, for one thing; it also disallows jumps unless you can prove the code still halts.

efee22

1 replies

4h53m

2024-07-22 13:37:12 UTC

It's not a silver bullet, however, it is still better to pushing all the panicable bugs into one community-maintained section (e.g. eBPF verifier). All vendors have an incentive to help get right and this is much better than every vendor shipping their own panicable bugs in their own out of tree kernel modules. Additionally, it's not just the industry looking at eBPF, but also academia in terms of formally verifying these critical sections.

lucianbr

0 replies

4h32m

2024-07-22 13:58:06 UTC

"Improves kernel stability" is great. "Prevents kernel crashes" is a plain lie.

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code.

Come on. Computers will continue to crash in the future, even when using eBPF. I am quite certain.

twen_ty

4 replies

4h30m

2024-07-22 14:00:00 UTC

Can someone tell me what's the advantage of eBPF over a user mode driver? The article makes it look it eBPF is have your cake and eat it too solution which is too good to be true? Can you run graphics drivers in eBPF for example?

bewo001

1 replies

3h34m

2024-07-22 14:56:25 UTC

AFAIK, an ebpf function can only access memory it got handed as an argument or as result from a very limited number of kernel functions. Your function will not load if you don't have boundary checks. Fighting the ebpf validator is a bit like fighting Rust's borrow checker; annoying, at times it's too conservative and rejects perfectly correct code, but it will protect you from panics. Loops will only be accepted if the validator can prove they'll end in time; this means it can be a pain to make the validator to accept a loop. Also, ebpf is a processor-independent byte code, so vectorizing code is not possible (unless the byte code interpreter itself does it).

Given all its restrictions, I doubt something complex like a graphics driver would be possible. But then, I know nothing about graphics driver programming.

umanwizard

0 replies

3h29m

2024-07-22 15:01:36 UTC

Fighting the ebpf validator is a bit like fighting Rust's borrow checker

I think this undersells how annoying it is. There's a bit of an impedance mismatch. Typically you write code in C and compile it with clang to eBPF bytecode, which is then checked by the kernel's eBPF verifier. But in some cases clang is smart enough to optimize away bounds checks, but the eBPF verifier isn't smart enough to realize the bound checks aren't needed. This requires manual hacking to trick clang into not optimizing things in a way that will confuse the verifier, and sometimes you just can't get the C code to work and need to write things in eBPF bytecode by hand using inline assembly. All of these problems are massively compounded if you need to support several different kernel versions. At least with the Rust borrow checker there is a clearly defined set of rules you can follow.

tptacek

0 replies

4h22m

2024-07-22 14:08:38 UTC

No, you can't run arbitrary general-purpose programs in eBPF, and you cannot run graphics drivers in it. You generally can't run programs with unprovably bounded loops in eBPF, and your program can interact with the kernel only through a small series of explicitly enumerated "helpers" (for any given type of eBPF program, you probably have about 20 of these in total).

chasil

0 replies

4h14m

2024-07-22 14:16:30 UTC

This is the wiki. I haven't kept up, but this isn't a kernel module.

"eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter (BPF, with the "e" originally meaning "extended") filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well."

https://en.wikipedia.org/wiki/EBPF

blinkingled

4 replies

4h58m

2024-07-22 13:32:21 UTC

Ok. But the good old push code to staging / canary it before mainstream updates was a simpler way of solving the same problem.

Crowdstrike knows the computers they're running on, it is trivial to implement a system where only few designated computers download and install the update and report metrics before the update controller decides to push it to next set.

Archelaos

2 replies

4h47m

2024-07-22 13:43:48 UTC

It would mitigate the problem, but not solve it. You can still imagine a condition that only occurs after the update has been rolled out everywhere. Furthermore, such a bug would still be extremely problematic for the concerned customers, even if not all of them were affected. In addition, it would be necessary to react very quickly in the case of zero-day vulnerabilities.

tantalor

0 replies

4h35m

2024-07-22 13:55:52 UTC

(semantic argument warning)

"Mitigation" is dealing with an outage/breakage after it occurs, to reduce the impact or get system healthy again.

You're talking about "prevention" which keeps it from happening at all.

Canarying is generic approach to prevention, and should not be skipped.

Avoiding the risk entirely (eBPF) would also help prevent outage, but I think we're deluding ourselves to say it "solves" the problem once and for all; systems will still go down due to bad deploys.

blinkingled

0 replies

4h12m

2024-07-22 14:18:18 UTC

Yes, I am not arguing against having the ability to deal with it quickly - I am saying canary/ staging helps you do exactly that. Because as we see in the case of Intel CPUs and Crowdstrike some problems or scale of some problems is best prevented.

phartenfeller

0 replies

4h41m

2024-07-22 13:49:30 UTC

Why trust somebody else not messing up? With that in place for windows and crowdstrike billions of dollars would be saved and many lives not negatively impacted ...

uticus

2 replies

3h31m

2024-07-22 14:59:09 UTC

eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox.

Isn’t one of the purposes of an OS to police software? I get that this has to do with the OS itself, but what does watching the watchers accomplish other than adding a layer which must then be watched?

Why not reduce complexity instead of naively trusting that the new complexity will be better long term?

riskable

0 replies

1h33m

2024-07-22 16:57:14 UTC

eBPF isn't "watching the watchers" it's just a tool that lets other tools access low-level things in the kernel via a very picky sandbox. Think of it like this:

Old way: Load kernel driver, hook into bazillions of system calls (doing whatever it is you want to do), pray you don't screw anything up (otherwise you can get a panic though not necessarily--Linux is quite robust).

eBPF way: Just ask eBPF to tell you what you want by giving it some eBPF-specific instructions.

There's a rundown on how it works here: https://ebpf.io/what-is-ebpf/

MetaWhirledPeas

0 replies

3h0m

2024-07-22 15:30:18 UTC

Right? I might spend a few minutes seeing if an AI chatbot can explain all the justifications that lead to using something like CrowdStrike in the first place.

throw0101d

2 replies

1h58m

2024-07-22 16:32:30 UTC

Meta:

eBPF (no longer an acronym) […]

Any reason why the official acronym was done away with?

sandywaffles

0 replies

41m

2024-07-22 17:49:45 UTC

Because eBPF is no longer just packet filtering? It's now used in loads of hook pionts unrelated to packets or filtering at all.

riskable

0 replies

1h25m

2024-07-22 17:05:49 UTC

Because it used to stand for extended Berkeley Packet Filter and it has since moved far, far beyond just packets. It now hooks into the entire network stack, security, and does observability/tracing for nearly anything and everything in the kernel ("nearly" because some stuff runs when the kernel boots up--before eBPF is loaded--and never again after that).

kevin_nisbet

2 replies

4h34m

2024-07-22 13:56:54 UTC

I hate to dispute with someone like Brendan Gregg, but I'm hoping vendors in this space take a more holistic approach to investigating the complete failure chain. I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure. It may be true, but if we don't do the analysis we could leave ourselves open to blindspots. There may also be plenty of alternative approaches that should be considered and appropriately discarded.

I think the part I specifically dispute is the only negative outcome is wasted CPU cycles. That's likely the case for the class of bug, but there are plenty of failure modes where a bad ruleset could badly brick a system and make it hard to recover.

That's not to say eBPF based security modules isn't the right choice for many vendors, just that let's understand what risks they do and do not avoid, and what part of the failure chain they particularly address.

ohmyiv

0 replies

1h24m

2024-07-22 17:06:21 UTC

I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure.

Microsoft has been working on eBPF for a few years at least.

https://opensource.microsoft.com/blog/2021/05/10/making-ebpf...

https://lwn.net/Articles/857215/

If you're really concerned, they have discussions and communication channels where you're invited to air your concerns. They're listed on their github:

https://github.com/microsoft/ebpf-for-windows

Who knows, maybe they already have answers to your concerns. If not, they can address them there.

mirashii

0 replies

2h3m

2024-07-22 16:27:38 UTC

Just because you have not been aware of the discussions on this topic that have been happening for years, doesn't mean that they haven't been happening. This isn't some new analysis formed 3 days after an incident, this is the generally accepted consensus among many experts who have been working in the space, introducing these new APIs specifically to improve stability, security, etc. of systems.

xyzzy123

1 replies

4h44m

2024-07-22 13:46:51 UTC

So many problems though! including commercial monocultures, lack of update consent, blast radius issues, etc etc. There's a commons in our pockets but that is very difficult to regulate for. The will keep putting the gun to your head until you keep choosing the monoculture.

shahahqq

0 replies

4h33m

2024-07-22 13:57:42 UTC

worrisome indeed that now the world knows how many users are affected by crowdstrike so the bad guys just need to poke deeper there

nkozyra

1 replies

4h21m

2024-07-22 14:09:21 UTC

I don't do any kernel stuff so I'm out of my element, but doesn't the fact that Crowdstrike & Linux kernel eBPF already caused kernel crashes[1] sort of downplay the rosiness of the state of things?

[1]: https://access.redhat.com/solutions/7068083

guipsp

0 replies

3h51m

2024-07-22 14:39:06 UTC

This is specifically addressed in the post you are replying to

kaliszad

1 replies

2h0m

2024-07-22 16:29:56 UTC

"These security agents will then be safe and unable to cause a Windows kernel crash."

Unless of course there is a bug in eBPF (https://access.redhat.com/solutions/7068083) @brendangregg and the kernel panics/ BSoDs anyway which you mention later in the article of course.

ec109685

0 replies

1h36m

2024-07-22 16:54:28 UTC

Benefit of fixing that bug is that all ebpf programs benefit versus every security vendor needing to ensure they write perfect c code.

Yawrehto

1 replies

3h49m

2024-07-22 14:41:06 UTC

1. How does eBPF solve this? It makes it more difficult, sure, but it'll almost always be possible to cause a crash, if you try hard enough. 2. More importantly, the problem is rarely fixable by changing technology, because typically, problems are caused by people and their connections: social/corporate pressures, profit-seeking, mental health being treated as unimportant, et cetera. eBPF can't fix those, and as long as corporations have social structures that penalize thoroughness and caution, and incentivize getting 'the most stuff' done, this will persist as a problem.

umanwizard

0 replies

3h40m

2024-07-22 14:50:18 UTC

it'll almost always be possible to cause a crash, if you try hard enough.

If you think you know a way to crash the Linux kernel by loading and running an eBPF program, you should report a bug.

WaitWaitWha

1 replies

4h28m

2024-07-22 14:01:58 UTC

eBPF == extended Berkeley Packet Filter

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter

kayge

0 replies

2h42m

2024-07-22 15:48:19 UTC

Thanks! This was not a familiar acronym to me... and after some digging[0] apparently it's no longer an acronym:

"BPF originally stood for Berkeley Packet Filter, but now that eBPF (extended BPF) can do so much more than packet filtering, the acronym no longer makes sense. eBPF is now considered a standalone term that doesn’t stand for anything."

[0] https://ebpf.io/what-is-ebpf/

CoastalCoder

1 replies

5h36m

2024-07-22 12:54:32 UTC

If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Are they saying that device drivers should be written in eBPF?

Or maybe their drivers should expose an eBPF API?

I assume some driver code still needs to reside in the actual kernel.

prmoustache

0 replies

5h22m

2024-07-22 13:08:21 UTC

These tool wouldn't need kernel drivers, only to target the eBPF userspace API: https://www.kernel.org/doc/html/latest/userspace-api/ebpf/in...

vfclists

0 replies

5h0m

2024-07-22 13:30:15 UTC

Yep, another fix to all our problems, a new bandwagon to be jumped on by wall EDR vendors, until ...

Here I am using the term "EDR". Until this CrowdStrike debacle I'd never heard it.

Only tells how seriously you should take my opinions.

usrme

0 replies

5h38m

2024-07-22 12:52:12 UTC

Does anyone know how far along the eBPF implementation for Windows actually is? In the sense that it could start feasibly replacing existing kernel drivers.

tracker1

0 replies

2h15m

2024-07-22 16:15:18 UTC

I don't buy it... didn't a bug from RedHat + Crowdstrike have a similar panic issue? I understand in that case it was because of RedHat, but still. I don't think this, by itself will change much.

the8472

0 replies

4h31m

2024-07-22 13:59:10 UTC

If the filters are loaded at boot and hook into everything then a bug can still lock down the system to a point where it can't be operated or patched anymore (e.g. because you loaded an empty whitelist). So it could end up replacing a boot loop with another form of DoS.

If microsoft includes a hardcoded whitelist that covers some essentials needed for recovery that could make a bug in such a tool easier to fix, but could still cause effective downtimes (system running but unusuable) until such a fix is delivered.

tgtweak

0 replies

29m

2024-07-22 18:01:42 UTC

Even if Microsoft rolls out eBPF and mainstreams it - it will be years before everything is ported over and it still won't address legacy windows versions (which appear to be a good chunk of what was impacted).

It's a move in the right direction but it probably won't fully mitigate issues like this for another 5+ years.

risenshinetech

0 replies

3h10m

2024-07-22 15:20:23 UTC

Thank God some superheros have finally come along to make sure code never crashes any computers ever again! /s

rezonant

0 replies

1h22m

2024-07-22 17:08:34 UTC

the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes

Oh I'm sure they'll find a way.

odyssey7

0 replies

1h34m

2024-07-22 16:56:33 UTC

"The verifier is rigorous"

But the appeal-to-authority evidence that the article presents is not.

"-- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage."

muth02446

0 replies

1h56m

2024-07-22 16:34:54 UTC

```The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage. ``` Wow, 20k is not exactly encouraging. Besides the extra attack surface, who can vouch for such a large code base?

mschuster91

0 replies

4h20m

2024-07-22 14:10:55 UTC

If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers.

How about Microsoft's large government and commercial customers make it a requirement that MS does not develop a single new feature for the next two fucking years or however long it takes to go through the entirety of the Windows+Office+Exchange code base and to make sure there are no security issues in there?

We don't need ads in the start menu, we don't need telemetry, we don't need desktop Outlook becoming a rotten slow and useless web app, we don't need AI, we certainly don't need Recall. We need an OS environment that doesn't need a Patch Tuesday where we have to check if the update doesn't break half the canary machines.

And while MS is at that they can also take the goddamn time and rework the entire configuration stack. I swear to god, it drives me nuts. There's stuff that's only accessible via the registry (and there is no comprehensive documentation showing exactly what any key in the registry can do - large parts of that are MS-internal!), there's stuff only accessible via GPO, there's stuff hidden in CPLs dating back to Windows 3.11, and there's stuff in Windows' newest UI/settings framework.

lazycog512

0 replies

1h28m

2024-07-22 17:02:01 UTC

"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair."

- Douglas Adams

ksec

0 replies

23m

2024-07-22 18:07:38 UTC

The article mentions Windows and Linux. Does anyone know if there will be eBPF for FreeBSD?

klooney

0 replies

3h7m

2024-07-22 15:23:13 UTC

First io_uring, now eBPF. Kind of wild.

egorfine

0 replies

1h12m

2024-07-22 17:18:22 UTC

One option to prevent this is to not run corporate spyware. But I guess for some industries this isn't an option.

dveeden2

0 replies

1h58m

2024-07-22 16:32:55 UTC

So eBPF is giving us eBFP (enhanced Blue Friday Protection)?

brundolf

0 replies

1h54m

2024-07-22 16:36:21 UTC

This sounds like a cool technology, but this was the really egregious problem:

There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general

You don't need a new technology to implement basic industry-standard quality control

bfrog

0 replies

1h58m

2024-07-22 16:32:46 UTC

I wonder if microkernels ever had this kind of bullshit. Had it been a microkernel, would we all be sitting twiddling our thumbs on friday? Hot take: No.

__MatrixMan__

0 replies

4h10m

2024-07-22 14:19:56 UTC

Maybe we should start taking Fridays off to commemorate the event, which probably would have been less bad if more people spent less time with their nose to the grindstone and had more time to stop and think about how it all was shaping up and how they could influence that shape.

Scene_Cast2

0 replies

4h48m

2024-07-22 13:42:46 UTC

How much extra security does this provide on top of HLK?

ReleaseCandidat

0 replies

3h50m

2024-07-22 14:40:38 UTC

Sorry, but neither eBPF nor Rust nor formal verification nor ... is going to solve that problem. Repeat after me: there are no technical solutions to social problems. As long as the result of such an outage is basically a "oh, a software problem! shrug", _nothing_ will change.

0xbadcafebee

0 replies

36m

2024-07-22 17:54:53 UTC

In the future, computers will not crash due to bad software updates

I'm still waiting on my flying car...