It's wild that we're in 2024, and there's a security issue that is essentially "we parse data in an ad-hoc format with an ad-hoc parser based on regexps, and we got that wrong". And the solution is "we tweaked the regexp", and not "we properly specified/documented the format (or replaced it with something properly specified), and wrote a parser that correctly implements the specification".
Even if I had to use regex, I'd at least do it in 2 stages:
- Extract all "author" lines, fail-fast if there is more than one
- Parse the line, again fail-fast if there is anything strange
this would just stop this, and similar, attacks completely. Postel's law has no place in modern development if we want things to be secure.
Also known as "make other people's incompetence and inability to conform to a spec your problem."
Yeah that “law” makes absolutely zero sense to me. It’s like “be nice” but in a self-sacrificial way which then indirectly sacrifices other people in the long run.
It makes sense for a lot of situations purely out of practicality. If a good portion of your customers use tools that break spec, your choices are to accept the brokenness or lose the customer.
Take browsers, for example. A lot of people complain that browsers try to work around broken certificate setups on servers, but if they didn’t, users would just switch to browsers that did.
Practical? So the adage is just “be liberal because you have to out of practical necessity”? Are you sure that’s the history behind it?
Idk it makes sense for UX sometimes. Really it’s a case by case basis and I don’t think you should be adhering to any law for this.
You're not wrong, but sometimes you have to be pragmatic. I've done a lot of work building interfaces to acquire clinical data from healthcare provider organizations for care quality improvement. There are clear industry standards from HL7 (V2 Messaging, CDA, FHIR) but many provider organizations don't implement them correctly, so we had to accept a lot of junk and just find a way to deal with it. Many of those provider organizations lack IT resources and are at the mercy of their vendors or consultants, so trying to hold them to strict standards results in an impasse.
It's from an era when interoperability in a then-very-small universe was very important to moving the Internet forward. Back then there were no spammers, and very few bad actors. It's possible that the "Postel law" helped for a while -- hard to say for sure though.
This still has the same problem. Your solution requires them to program the first "Extract all author lines" step to accept both lines, namely they would have had to make sure their extractor considered author lines with a zero-length author name to be "an author line". But if they had the foresight to do that they could just as well have done it with their original regex to begin with.
I don't see how?
As I said, actually parsing the line is the 2nd step. The first step is to extract all lines which start from "author ", without checking what the rest of the line looks like. That could be something like starts_with?("author "); or split by first space and check if first word is "author"; or verify against "^author " regex; or take first 7 chars and compare against "author ". All of those methods would easily catch the duplicate line in the attack and reject it.
(you might be curious if simple "author", without any argument, would be bad. First, even if it was, the impact would be pretty low as there is no second name to inject. Second, there is no need to worry: git actually checks for space after "author" [0], simple "author\n" by itself would be rejected)
[0] https://github.com/git/git/blob/e02ecfcc534e2021aae29077a958...
I'm saying that this already requires knowing that "without checking what the rest of the line looks like" is an important consideration. It would be just as valid to define "extract all author lines" as "extract all lines that have `author <value>`, as they did.
Only by knowing that the latter definition causes the bug that the TFA found (either by thinking about it or by testing against the canonical git implementation) would it become known that that definition is wrong.
The problem is not the structure of the parser. The problem is that they didn't check that their implementation matches the behavior of the system it speaks to (git).
Edit: Also covered in https://news.ycombinator.com/item?id=39101419 / https://news.ycombinator.com/item?id=39101736 / https://news.ycombinator.com/item?id=39107945
Well, yeah, you have to think about security to write secure code.
Let's say you are working in Microsoft and you need to implement commit parsing, and all you have is few examples. Which of the statement below matches your thoughts the best?
1. I am sure no one will ever pass anything unusual or try to hack this, after all this is just an publicly accessible endpoint which protects important user data. I am going to go with solution that is fastest to implement, single regexp.
2. It's important that this function does not fail, so I am going to do my best to try to find a valid user email. If there are multiple lines or some lines are malformed, I'll just keep searching. I'll use a single regexp to make sure my code is robust to frontend changes.
3. Looks like relevant line is "author " followed by name and email in some format followed by datestamp... I know email parsing is extremely complex, so it's important to get this right. I wonder how they handle all the details - national characters in email, quotes in real name.... Oh well, I'll just grab the whole "author" line and implement the strict regexp check, and if anything is weird I'll fail the request. This may reject a user with weird name, but we can fix it when if the user complains. Oh, and I know git only allows single author per commit, so I am going to enforce this too.
4. All of the data samples are in the exact same format: "tree", "author", "committer", "gpgsig", space-prefixed gpg signature, empty line, commit message. And they all come from same origin (codespace javascript), so I don't they will change too much. I am going to hardcode this specific field order, and if there are any discrepancies, I will fail the request. The frontend team may get annoyed at me if they update git library and tests start to fail, but at least the system would be secure.
---
So I assume no one is going to argue for option 1 (although all those links you have posted are making me doubt)
Option 2 is what Postel's law is about. It was really popular at early internet, as it allowed much more compatibility between systems. I'd argue it is not a good idea in today's internet at all.
Option 3 is what I would do myself. Note I am not using any knowledge about git internals, this is all just generic defensive thinking about network protocols.
Option 4 is what I'd do in really secure systems where I control all parts. Could be a bit of overkill for the github case though...
I didn't know, so if anyone wants to know, here's the summary from Wikipedia.
https://en.wikipedia.org/wiki/Robustness_principle
It's a tough balance. Postel's law is from a world before security mattered. It is incompatible with correctness.
The early Internet definitely wouldn't have worked without a thick layer of "you set this flag wrong in the header but I know what you meant". But these liberal formats have been the death of many a "0 remote root holes in the default install" OS slogan ;)
Git, meanwhile, would be something like:
People would be mad and would stop using Git, so "git clone" has to accept whatever.But the signature generating-code doesn't need to be this liberal; it can run this validation and say "I'm not going to sign that", and nobody would be mad. Github's implementation is just a shortcut; some engineer tried their flow against a test commit they made, it worked, they checked it in, they shipped the feature. Then someone thought "what if there are two author lines", and broke the signing code. (Maybe write some fuzz tests that emit headers and try signing them? You might not find this bug, but it can help.)
The problem is the missing spec, but specs are useless if not enforced, because people won't know that they accidentally violated the spec. That's what's tough. Postel's Law guarantees working software. But sometimes working means failing open. (Failing closed can definitely suck. Ever been late to work because of a "signal problem"? The signalling system failed closed and prevented your train from moving even though there wasn't any actual reason to prevent movement. Postel's Law would say "it's just a burned out light bulb; power ahead at line speed, not expecting a train randomly misrouted and barreling into you head on". 99.9% of the time, it's right! But there are a lot of dead 0.1%-ers, which is after the early days, we decided to make the system fail closed.)
It makes sense when you want rapid adoption, as was the case with the Internet.
Especially wild for GitHub, I would have thought they had the entirety of 'git foo' codified up with actual parsers by now, but I guess a regex will always do..
This doesn’t surprise me at all coming from github. I’ve tried using their public api’s, and they’re all shockingly bad. In some cases I thought it was my mistake, but then I checked their website, and found it was bug for bug compatible with the thing I’d built.
My usual rule of thumb for choosing products is to go with whatever rising alternative there is to the entrenched monopoly. The little one needs to be much better to overcome monopoly effects.
I don’t sign my own paycheck, but would be moving to gitlab or something if I did.
Well the user-git-facing parts of GitHub are built on libgit2 and libssh.
I must be missing something because the corrected regex (/\Aauthor ([^<]*)[ ]{0,1}<(.+)>/) seems to behave the same as the problematic one:
I wonder whether github may not be using PCRE.The problem wasn't that the regexp matched the second "author" line, but that it didn't match the first one.
There's nothing inherently wrong with using a regular expression here.
The security issue is that there are two different parsers and they interpret invalid input differently. This is a parsing differential vulnerability.
This security issue can (and does) occur whenever there is more than one parser implementation.
git treats objects as a generic structure ("list of space-separated name-value pairs, maybe with space continuations, followed by double newline, followed my message") followed by additional rules for individual values. Something very common in older protocols -- see HTTP, deb control files, email, etc...
And github code decided to short-circuit the process and combine two steps into one, which introduced weird inter-layer interactions where lines with good name but invalid content would be silently rejected.
I wrote a chintzy Preact app that pulls the source for a Pokemon Red hack from Github and makes Pokemon stats, moves, type matchups, etc. searchable [1] because enough has changed from the original that I wanted a quick reference for my current playthrough. I literally wrote in the TODOs in the README
because I felt guilty about doing things like Maybe I shouldn't be so hard on myself.[1] https://github.com/alexjplant/rgbdex
A proper AST would "read" the source code programmatically, as opposed to brittle regex? Is there any benefit to building that when your regex works and what you're parsing is static? (Genuine questions, my compsci background is not strong).
This is the correct take.