return to table of content

A good day to trie-hard: saving compute 1% at a time

amluto
60 replies
1d3h

If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:

- An entire separate dictionary or other data structure.

- One single header containing all internal metadata.

- All headers have a prefix, and the internal ones start with I and the external ones start with E.

- All internal headers start with “CFInt”.

I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)

The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.

efitz
16 replies
1d3h

Ive worked at several huge corporations in IT security, where we care about headers a lot, and they all use headers in a manner similar to CloudFlare.

Including using proxies at the edge to strip out internal headers bidirectionally- yes, inbound too.

sitkack
12 replies
1d2h

That doesn't make it better, it makes it worse.

hinkley
11 replies
1d2h

Are you on a corporate network? Do you use a firewall at home?

You’re on enclaves all the time. This is just a different one. Separate networks per class of traffic used to be de rigeur before Cloud. Now it’s all munged together.

sitkack
9 replies
1d2h

I understand that security needs to be pragmatic, but because security does it, doesn't make it right.

It isn't about being in the enclave, it is having to keep track of what headers you set vs external. It is fragile and error prone and it will absolutely break someone else when there is a name collision.

All security exploits are a subset of bugs.

hinkley
8 replies
1d1h

And what would you have them do instead? SSL to every service? Reauthenticate on every single call? What’s your clever solution?

There’s probably a dozen things you can terminate at a boundary that would cost so much to run through the entire system that it would bankrupt a company.

And then there’s tracing, which also uses headers.

immibis
6 replies
1d1h

one of the listed options was to begin all the Cloudflare internal header names with "CFInt"

hinkley
5 replies
1d1h

That’s still header filtering. And it should be X-CF- by the way.

amluto
2 replies
1d1h

X-CF- seems to be used by some other software. And Cloudflare has plenty of documented headers that don’t start with X.

The whole point is that a genuine user header should never conflict with a Cloudflare internal header, and a browser should never see an internal header.

sitkack
0 replies
1d

Having re-read the thread, I totally agree. What you have listed is table stakes. I'd also say that internal headers would also be encrypted to a narrow protection domain.

If all internal headers were prefixed with X-CF-, you could strip them all via SIMD that had no knowledge of any specific header. Hell, you could probably do it on the NIC.

hinkley
0 replies
1d1h

Please reread the entire thread. We are talking about someone who thinks the header solution is stupid. You are both splitting hairs. Stay on target.

hinkley
0 replies
23h49m

Good to know.

(Also sibling is right, I spaced on X-CF- being a header sent to CF customers’ servers. I don’t used cloudflare but cloudfront does the exact same thing)

8n4vidtmkvmk
0 replies
13h0m

Maybe I'm dumb or missing something, but why not just do what Google does -- convert all inbound HTTP requests into Protobufs, separate out the internal stuffs from the external stuffs, run it through your myriad of internal services, and then on the way out you still have your nicely delineated protobufs which you can convert back into HTTP. Why are we mucking about with headers in the first place?

the8472
0 replies
23h45m

Configure all machines in a way that they can survive on some random untrusted public wifi. This should be the obvious default stance for laptops and phones.

But even for workstations wired to a trusted LAN it still makes sense because you never know which of the various tunnels might assign and expose some IPv6 address to the internet.

For servers you might be able to make an exception if you have a vigilant IT and the people with admin access aren't actively trying to circumvent security, but even then that better not be your only layer of security.

hinkley
2 replies
1d2h

The first large scale piece of software I worked on was for telcos pre smart phone. We used internal headers to offload authentication and terminate SSL. We also had to pressure F5 to fix about half a dozen bugs in BIG-IP to do so. Bugs that should in no universe have existed in version 9 of a product.

I used to joke that F5 owed me and my coworker 3 months of salary for all the free QA we did for them.

tmsite
1 replies
1d1h

It helps if you realize that BIG-IP 9.0 was essentially a from-scratch rewrite of BIG-IP 4.5. Among other major redesigns, the data plane was moved from BSD kernel space to Linux user space. Internally, the joke was that it would be two times better when we were done (4.5 * 2 = 9.0). It probably was, but not on day zero.

hinkley
0 replies
1d1h

On day sixty it couldn’t do SSL termination and cookie based traffic shaping didn’t work.

I was a little bummed it was “just” a Linux box but that’s pretty common today. I hadn’t discovered dd-wrt yet and wouldn’t for a few years.

jgrahamc
15 replies
1d3h

Yeah, me too, but systems grew over time and grew and grew and we were using HTTP headers for all sorts of stuff. This optimization makes the situation better, but the solution (which is underway) is to use a different mechanism for IPC and get rid of the use of HTTP headers completely.

amluto
6 replies
1d2h

I’m certainly familiar with systems growing over time and outgrowing their original schema.

Is this the same phenomenon that resulted in Cloudflare Tunnels apparently being domain names? Or why domains that are proxied by Cloudflare show up as “CNAME” in the DNS panel? (The latter one seems extra strange — wasn’t the reverse proxy service the very first service that Cloudflare offered? There must be some history here.)

hinkley
5 replies
1d2h

All new products are full of scaffolding that has to be removed when robustness becomes a higher priority than existence. Where we fail is not calling it that. Instead we just call some code “good” while our thicker skinned coworkers call the rest “fine”. It’s not fine. I don’t want to be working overtime the week of thanksgiving because you can’t think past the horizon.

gmueckl
4 replies
1d

There is a very fine line between "good enough" and "not good enough" in any product beyond a certain complexity. Finding the pieces that cross that line can be hard and improving "good enough" parts is (sadly) mostly a waste of time in commercial settings.

hinkley
2 replies
1d

You have to backdate starting a project so it happens before things stop being good enough. And take into account bad estimates and the five other deadlines between then and now.

Often it’s better to clean as you go. If you have the time and the inclination, shore up one of the least good enough things.

It’s not unheard of to open up new feature sets via refactoring. Something that previously would have taken an “impossible” amount of time now becomes profitable, not just possible.

dmorgan81
1 replies
17h18m

I miss working at a place where that was encouraged. Now if I do that the testers complain that it’s not in scope for the ticket and to split the changes into another ticket with a separate test plan. The rot thus continues.

hinkley
0 replies
16h58m

I think this is about when I discovered zone defense for performance optimization. QA doesn’t want to chase your code changes halfway across the codebase, and that’s fair. At some point it starts looking like you’re changing things just to change them, even if it ends up being a few percent better.

But if you limit yourself to one part of the interaction, they don’t get as mad. A dozen changes in the signup system piss them off less than five changes across the app.

didgetmaster
0 replies
1d

This is why some bugs never get fixed (that is if you define a bug to include inefficient code and not just code that breaks something).

DamonHD
5 replies
1d3h

And when that happy day dawns, slightly sadly, this nice optimisation will evaporate!

jgrahamc
4 replies
1d2h

Maybe the real treasure was the optimizations we made and discarded along the way.

DamonHD
2 replies
1d1h

Indeed. I am very happy with some of the tricks that I developed (speeding up, and indeed allowing, Solaris C++ linking by prefiltering on MD5 hashes of canonicalised .o files -> nm, anyone?) even though they were lost without trace (yep, Lehman Brothers, where are my Imakefiles now?)...

jkaptur
1 replies
1d1h

They're probably as critical as ever at Barclays :)

DamonHD
0 replies
1d

I really hope not, given that that very old Solaris compiler should have been laid to rest many many years ago. Yes, some of my stuff had kinda made it into BarCap (even onto my desktop dev machine!) when I turned up there later, and even some at Nomura when I rocked up there again IIRC!

Twirrim
0 replies
23h16m

My likely overly cynical concern here is that this suggests trie-hard will soon end up abandoned, as you're making it sound like it is a stop-gap solution for you.

throwaway77385
0 replies
1d3h

For anyone not in the loop, the above poster (JGC) is Cloudflare's CTO himself :)

alex_suzuki
0 replies
1d1h

You chiming in on this post makes me slightly bitter at having gone with CloudFront (I have everything on AWS so it seemed the obvious choice) instead of Cloudflare.

hn_throwaway_99
9 replies
22h7m

I feel like I can point out lots of similar problems to the other solutions you suggest (heck, I thing some of those problems you list even apply to those other solutions).

The list approach has some downsides, but it also has a bunch of upsides. I feel like when people like to point out potential flaws of these approaches, they're ignoring the history and difficulties that comes with Cloudflare's scope. An enumerated list is the simplest and most flexible of the approaches, and it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare, potential technology acquisitions, etc.

Aeolun
6 replies
18h38m

it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare

That’s more of an indictment of how little effort is spent aligning things. It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

poincaredisk
1 replies
17h25m

* Headers that are originally private, but are later changed to public

* Headers that are intended to be public, but then are changed to private

* Headers that were created in the very early days of the company, which are so deeply integrated into everything that renaming them is basically impossible (or at least very hard and risky)

* Headers that are set by some appliance/third party solution not controlled by CF, but which are technically internal and must be stripped.

* Newly acquired company has its own set of internal headers. Management is not amused when a huge, disrupting refactor is suggested by engineers.

And this is just a tip of the iceberg of possible problems when your scale is big enough.

bhawks
0 replies
12h49m

So because everything is a Byzantine labyrinth due to scale we're all going to agree to remember to add internal headers to strip to another service?

Maintaining the header to strip list is just another piece of technical debt to keep paying down.

Having a policy to strip everything under a prefix seems much less error prone.

ketzo
1 replies
18h6m

It’s not that hard

How many years do you have to spend in software before you stop uttering those four cursed words?

Aeolun
0 replies
12h27m

It takes just as much experience to acknowledge that the cure is worse than the disease.

hn_throwaway_99
1 replies
15h44m

It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

To be a little bit blunt, this just means you have never built software in a very large company that may have a lot of existing/completely independent software, or have made technology acquisitions after the fact. That is, if you are starting from "time 0" of course it's very easy to say "internal headers need to start with CFInt", but that's rarely the case.

Aeolun
0 replies
12h25m

I think if you work for such a very large company the only reason you are ok with it is because nobody at the top says ‘no more’.

If they do, it suddenly becomes very easy to get everything aligned, because there is likely a vast collection of bodies just collecting a paycheck that can be more productively assigned to manually renaming everything.

amluto
1 replies
20h4m

Lists, extensible protobufs, etc are indeed great in their extensibility. The issue I would see (if I worked at Cloudflare) isn’t to using a list — it’s that the list is the same list that’s used for externally visible HTTP headers.

hn_throwaway_99
0 replies
15h41m

I don't see that as a problem at all:

"Hey any teams out there, before you can add internal headers, you need to register them with some service." Then the list is loaded from static storage at startup time and/or refreshed on some periodic basis.

taeric
5 replies
1d2h

Other ideas:

  - Ensure all headers added internally that are not for export at the front of the header list
  - Preseed the hashtable of all requests with internal headers you plan to remove.  
In fact, if you preseed, you are basically combining these ideas but fixing how many internal headers are on each request. At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

amluto
4 replies
1d2h

At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

While Python provides data structures like this out of the box, designing a big system to require a special, nonstandard data structure to identify private metadata seems like a mistake to me.

taeric
3 replies
1d1h

I'm not sure I follow? Does rust not have the equivalent of a LinkedHashMap from Java? All I'm proposing is to start each map of headers off with a sentinel value for all of the headers you don't intend to send back to the client. Then, have a method that iterates over all values, for the places you need to send to internal servers, and another (probably the default?) that starts iteration after all of the internal items, which should just be a different pointer. At this point, there is no need to inspect each header by name, as that was effectively done at compile time when you put the protected headers at the front of the list.

Is this somewhat special? I mean, sure. But it wasn't long ago that a hash table was already in the special territory of data structures.

Edit: I should add that I'm not certain this idea would be better. Just more ideas on what you could do.

amluto
2 replies
1d1h

There’s a third party crate for this. C++’s STL also doesn’t have it by default.

The creation-order-preserving hash map is basically two data structures in one, it’s more complex and slower than a normal hash map, and IMO it’s rather bizarre and I have never really understood why anyone wants one.

taeric
0 replies
1d1h

I would be surprised if it was dramatically slower, all told. Though, fair that I have not benchmarked in a long time, here.

My main "idea" here is to stop from having to check all of the headers on a "per header" basis. Yes, you can make each check faster, as the TRIE does. You can also remove the entire reason to check individual headers by starting a traversal after where the internal headers are.

You could also go with something like making sure the first header is "InternalHeaderCount: N" where you then keep the internal headers segregated form the others by a simple count. (This assumes you have solid control over what is adding the headers internally, of course.)

(I also forgot to give kudos on the blog title. Any Die Hard reference is a good reference. :D )

Conscat
0 replies
1d1h

It's especially unfortunate that the intuitive name, `std::map`, is that ordered and generally least useful option of the standard library's three hash map containers.

My last job did rely on ordered-ness of `QMap`, because the elements are listed in a Qt GUI and it could confuse users if those widgets randomly rearranged themselves. That's the only use-case I've personally encountered for an ordered hash map.

e63f67dd-065b
3 replies
23h56m

Former employee here; the interesting (horrifying?) fact is that you can set some of these internal headers (I remember a notable bug with cf-cache-status) in workers (the serverless offering) and cause all kinds of bad things to happen :)

Aeolun
2 replies
18h40m

It seems to me it should be trivial to strip anything internal that comes out of a worker right?

ketzo
1 replies
18h7m

Trivial unless someone wants to modify CF internal headers as part of their solution…

https://xkcd.com/1172/

virtue3
0 replies
17h26m

Haha thank you

620gelato
3 replies
1d2h

Or perhaps even, insert yet another header with just the list of internal headers being added to the request, assuming this happens at a single place, otherwise a recipe for disaster.

I have a slightly different example of this, where a rpc framework used in my company disallows the service owner from modifying certain headers (say request identifier), and will instead create a duplicate header with a suffix. In that scenario at least, I can see this as a fairly reasonable tradeoff, as the goal is to control certain headers from being modified, not because they are platform internal but because there are certain assumptions associated with those headers that are followed company-wide.

I'll go check what mechanism is used for this matching.

adrianmonk
1 replies
22h31m

yet another header with just the list of internal headers

Or the same but with a list of headers which AREN'T internal.

You'll probably have a custom header-adding function that people should always use instead of the regular one. And this way, if someone forgets to use it, their header will get stripped.

You can think of a header escaping the internal network as something that needs to be authorized. This is a deny by default approach.

620gelato
0 replies
21h54m

Ah yes - perfect.

amluto
0 replies
1d1h

This doesn’t work if the front end is older than an internal node and a header gets added, at least not without extra logic to patch everything up and handle conflicts.

The HTTP Connection header works like this, and one should generally assume that almost nothing implements it correctly.

giancarlostoro
0 replies
1d1h

What if someone else uses this name?

Couldn't you bypass this by pre-fixing the ones that aren't yours? Or prefixing your internal ones with something unique enough?

danbruc
0 replies
20h41m

This might of course introduce quite a bit of overhead, but the clean solution would arguably be to not mix requests to begin with, the external requests you are processing are the payload of the requests you are sending around internally. This would also allow you to cook up an optimized encapsulation protocol that might be more efficient to process than parsing and modifying HTTP. This admittedly comes from someone with essentially zero knowledge of what exactly Cloudflare is doing to the requests they receive.

anonymoushn
0 replies
19h13m

X years ago the manager of the cache team raised the lack of data plane and control plane separation as an issue and we came up with various plans to fix things, but I guess nothing has happened since

mightyham
18 replies
1d2h

I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.

supermatt
14 replies
1d2h

They mentioned it is the overhead of the hashing that makes it slow compared to the trie.

hooli42
7 replies
1d1h

Hash functions can be as simple as a single modulo.

torusle
3 replies
19h13m

Or as simple as using the hardware accelerated CRC32 that we have in our x86 CPUs.

Last time I checked, CRC32 worked surprisingly well as a hash.

hooli42
2 replies
7h29m

Hah, neat.

The 'weird' instructions can often be 3-4 orders of magnitude slower than arithmetic instructions, though I doubt that matters here.

Validark
1 replies
4h22m

CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).

hooli42
0 replies
49m

Sure, in the same way SIMD instructions get a linear speedup in theory.

If you Google the words "CRC32 is slow", you can see hundreds of people complaining about this.

ironman1478
2 replies
1d1h

As they said though, the hash function on a string type requires looking at every element in the string. Also, modulo is historically very slow. So, they still have to deal with O(n) operations, where n is the length of the input string. If they can improve the memory hopping problem associated with lists / graph structures (which they claim to have done in their library), then a trie could would be much fast enough, which is what they observed. Combined with the fact that they claim that 90% of the time there is a miss when querying the trie, then you exit early a lot, whereas you always need to compute the whole hash on the input string when doing the hash map strategy.

mightyham
0 replies
1d1h

The hash does not need to be computed on the whole string. I pointed this out in my other comment but just as a example: a hash function could be as simple as xoring the first 16 and last 16 bits of the string then indexing a 2^16 array. That means hashing is two pointer offsets and an xor (no modulo required). If there are 100 strings that need to be removed, then ~99% of rejections will be a very fast O(1).

And in the case of a match, a fast hash + memcmp will be way faster than a trie traversal. In fact, according to the trie-hard github readme, std::HashMap is already much faster than their trie implementation when searching for a match.

hervature
0 replies
1d1h

As they said though, the hash function on a string type requires looking at every element in the string.

Maybe for the default hash function. As another commenter pointed out, your data may make the following hash very effective: s[0] + s[len(s)//2] + s[-1] which would be very fast. The point being is spending a day seeing if such a hash exists is worth it.

mightyham
4 replies
1d1h

The point I'm making is that a specially optimized hashing function would probably blow away any trie traversal. When you know the data being hashed before hand it is possible to make custom hash functions that are as simple as a few bitwise operations on the first, middle and last character of the string (just an example).

treis
0 replies
1d1h

Isn't the sticky wicket collisions with external to CF defined headers?

dividuum
0 replies
1d1h

Right. Out of curiosity I looked into what code gperf generates and it seems it would end up O(1) for misses (it's always three lookups into a generated table) and O(L) for hits, as a potential hit has to be confirmed by essentially a strcmp. Not sure how something like https://github.com/rust-phf/rust-phf would work, but probably similar?

dathinab
0 replies
10h20m

if all your internal headers start with cf- and most non-internal headers don't (but not all) and most headers are non-internal a trie might be hard to beat

Denvercoder9
0 replies
19h38m

They don't know all the data that's being hashed, though. They know everything that's in the map, but not all the keys they're trying to lookup.

vlovich123
0 replies
1d

If they just used the default, that's SipHash which is optimized for security and DOS resistance, not performance. XXh3 is ~10x faster than that and there's some newer ones that are even faster than Xxh3 like GxHash (like another ~1.5x-2x faster).

You would precompute the hash for the constant string list & then compute the hash only for for the set of strings that begin with C/c since "cf-" is a common prefix for those internal headers if I recall correctly, using RawTable to find entries by hash & then comparing against the precomputed value to see if a collision matched before checking a string for equality.

rapsey
0 replies
1d

Note FxHashMap is just the std HashMap with a different hashing algorithm.

samatman
0 replies
21h35m

A hash function can't reject a string with a single test of the first byte. It just can't.

For this application, that's a leg up which a hash won't be able to touch. The rest of the art here is shaving down the constant factor of a trie far enough that the initial boost pays off in real-world performance.

hyperpape
7 replies
1d

I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.

burntsushi
6 replies
1d

regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.

But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences:

* The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects.

* The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.)

* The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route.

* The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance.

* trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie.

vlovich123
2 replies
21h21m

can use up a lot more memory and take longer to build

Does it support const construction so that you build it during compilation from a static set of strings?

burntsushi
1 replies
21h15m

Nope. That doesn't usually help in practice because it's not hard to amortize construction with a std::sync::LazyLock.

(The `memchr` crate doesn't even support const construction of a single substring searcher. I haven't spent much time thinking about it because `const` is still pretty limited in what it supports, although progress is being made. For example, you can't use trait methods in `const`. So while you can undoubtedly use `const` Rust to build a single substring searcher, whether `const` Rust can be used to build the one in `memchr` specifically isn't clear to me. And if it could be done, it will undoubtedly come with development/code-clarity costs and it's not clear to me that those are worth paying. Because as I said, amortizing construction for a static set of strings is usually very simple to do.)

And in any case, "takes a long time to build and uses a lot of memory" isn't avoided by doing it at compile time. Instead, you get bloated binaries and longer compile times. Which costs matter more? Dunno, especially when you amortize construction at runtime.

vlovich123
0 replies
21h3m

Yeah I guess I was mainly just asking "is const powerful enough yet for this to be a trivial addition".

Instead, you get bloated binaries and longer compile times

Longer compile times probably are solved by having the strings live in a standalone compilation unit which the compiler will end up caching and you can only dirty it by adding a new string. Needing to shuffle that extra data around might still cost extra compile time, but in a real world program I suspect that's likely to not matter (both in terms of binary size & that other things in the program will be more expensive to compile).

That being said I agree that doing a LazyLock is probably sufficient and given the amount of entries being compared against acquiring the lazy lock probably also gets amortized (it's cheap but not free).

samatman
2 replies
21h15m

This is a gratifying read (along with the article) because I've been working on closely related things, and had come to the tentative conclusion that taking Aho-Corasick, anchoring it (so no failure transitions), and laying it out in contiguous memory, basically results in Liang's packed trie from TeX's hyphenation algorithm.

trie-hard seems to be effectively that, but with some clever masking to get mileage out of having a limited byte vocabulary to deal with.

The DFA might pull ahead, but I wouldn't just bet on that. Branch predictors don't like lookup tables much, and the memory locality hit is small but real: more of a packed trie fits in a cache line than would a DFA.

You seem to have landed on similar conclusions here: beating a packed trie for this specific use case is going to be tough.

I also like that they hit on the same popcount-and-shift technique I've used in a data structure for UTF-8 character sets: https://github.com/mnemnion/runeset. I admittedly haven't benchmarked this much, correctness first and all, but seeing that Cloudflare has seen good numbers for something closely related reinforces the few tests I have run.

As a text-matching connoisseur, you may find the runeset interesting. It turns out to be hard to search for papers on "UTF-8 character set", so if you're aware of any prior art here I'd love a heads-up. It's fundamentally a generalization of using u64 bitmasks for sets of ASCII, which is as old as the hills at this point.

burntsushi
1 replies
20h50m

RE DFA: possibly, but not completely obvious to me. The DFA in aho-corasick doesn't typically have 256 different transitions per state. It uses "byte classes" to represent a typically smaller number of equivalence classes. Two bytes are in the same equivalence class if a match/non-match cannot be distinguished between them.

To be clear, I'd bet trie-hard is still more space efficient. And using byte classes does require an extra look-up per transition.

But, worth measuring.

As your your RuneSet thing, I'd look at lightgrep and icgrep. I'm not 100% they have relevant work for you, but they might. In practice, I just combine sets of codepoints down into UTF-8 automata since that's needed anyway for lazy DFA matching.

samatman
0 replies
16h40m

not completely obvious to me .. worth measuring.

I concur.

lightgrep and icgrep

These are great tips for a "related work" section, but nothing like prior art as it turns out. So yes, relevant and appreciated.

UTF-8 automata .. lazy DFA matching

I wouldn't expect RuneSet to be useful for any automata-based regex pattern matching approach, there's not a good way to incorporate the context switch, among other considerations, and while it's fast, I will be surprised (very pleased, but surprised) if it's faster. It's designed as a component of an LPeg style parsing machine, where it can be added inline and treated as an unusually large instruction. I haven't done more than time its test suite, which has shown that it's fast enough for the intended use, and that optimization time is best put into other parts of the design for now.

miso2024
6 replies
1d2h

Finally a blog post with a trie. Those trie leetcodes didn't go in vain ;)

marcosdumay
1 replies
1d1h

You will find them in any fast spelling or grammar verifier.

steve_adams_86
0 replies
1d

I’ve used them in real code! They are pretty intuitive for some use cases. The last time I used a trie was for address lookup. I even used a generator, haha. Double whammy leetcode in real life situation I guess.

It’s definitely not a common tool I reach for though.

jitl
0 replies
12h38m

i find myself reaching for a trie/radix-tree from time to time in medium-complexity build tools or code analysis scripts, where I need to store some information about zillions of filepaths, and factoring out the common prefixes saves a ton of memory while being pretty intuitive

cdelsolar
0 replies
2h39m

Tries are huge for Scrabble move generators and playing programs, of which I've written a few. The fastest structures are compressed tries with rotated representations of words.

ayuhito
0 replies
23h12m

I recently got a chance to find a practical use for a modified hybrid trie which was fun to implement! Lots of optimisations to go around.

It’s a nifty user-agent parser for Go: https://github.com/medama-io/go-useragent

Typically this sort of problem relies on lots of regex parsing, so it was nice to take a more novel approach.

cwilby
6 replies
1d2h

Internally, we call the request’s destination server its “origin”

Origin: The point at which something comes into existence or from which it derives or is derived.

How can the request's destination server be the origin, if it is the destination server?

toast0
1 replies
1d2h

The origin is the origin of the content.

cwilby
0 replies
1d2h

Thus cross origin resource sharing. Thanks!

At least it's consistently inconsistent.

tills13
1 replies
1d2h

On the internet, the origin is the server sending the response to the user. I suppose you can look at it from the perspective of the owner of the server -- from their frame of reference, their journey _starts_ when they receive, process, and respond to the request.

Granted, computer scientists are infamously known for being terrible at naming things.

hinkley
0 replies
1d2h

Had any fun with “Referer” headers lately?

taeric
0 replies
1d2h

This hilariously reminds me of the complaining about X server semantics back in the day.

kristianp
5 replies
21h35m

I would have stopped at the first step, saving almost 1% of cpu time (0.993%). Creating a new fast trie data type for the other 0.287% seems inefficient. Surely there are other parts of the code that take more than 1% of the cpu time elsewhere to be looked at first?

FridgeSeal
4 replies
19h13m

From the article:

At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths...function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Given how many requests are going out, and just how hot the function is, every single gain matters. If my read of the article is correct, this is the hottest part of the hot path, so every gain here is super important.

__turbobrew__
3 replies
14h15m

680 cores is peanuts. That is less than 10 modern specced compute servers. Its not nothing but you are probably talking about $20-30k/year in opex costs, which at the scale of cloudflare is probably a drop in the bucket.

Obviously it is an improvement but a lot of time there is lower hanging fruit, like the fact that the cloudflare control plane is/was not globally distributed and could not tolerate data center outages.

FridgeSeal
1 replies
12h33m

Idk man, I’m not Cloudflare.

That said: probably different teams with different priorities.

I also disagree that “it’s _only_ $20-30k” as some kind of reason to not do this kind of work. Think of it another way: that’s now another 680 cores that can do _net more work_, that’s another 20-30k we don’t have to spend when traffic increases. “I landed a small PR that wiped $30k net off our costs” is a statement that would make a lot of people in a business pretty happy.

__turbobrew__
0 replies
10h58m

I work in a tech company which is a bit bigger than cloudflare and anything less than 6 figures/yr opex reduction is a rounding error. It is pretty regular to save the company 7 figures per year when opex is in the billions.

It is about the opportunity cost, nobody is going to disagree that small optimizations are a net positive but what else could you have done with your time? Maybe cloudflare is at the point of maturity where there are no easy wins and only 1% gains left but I kind of doubt it given their reliability track record. How many dollars do they lose for every hour they are down? Probably more than $20k.

cryptonym
0 replies
8h23m

Marketing content is more valuable than 680 cores. Let the guys have fun on developing tries, as long as they produce an article.

Scaevolus
5 replies
1d3h

Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.

vlovich123
0 replies
21h25m

A bloom filter will have the same weaknesses of needing to do a hash. For static filters like this though where you have the set of strings to check against known up front, something like a binary fuse filter could work really well as it's significantly faster than bloom filters when you don't need online construction (you still need to do a hash of course but the evaluation against std::HashMap is flawed since it uses SipHash instead of modern constructions like xxh3 or gxhash).

Scaevolus
0 replies
1d2h

Yeah, for this problem you'd want a bloom filter that can entirely fit in vector registers during the computation, which should work well for a few hundred items.

pragma_x
0 replies
23h43m

Considering that each node's filter in their trie implementation is bloom-like already, the solution is awful close. I agree that testing wider swaths of the string at once might be a dramatic accelerator.

A more naive approach might also be warranted. Some frequency analysis of hit/miss for trie nodes might allow them to find specific character positions with a higher miss rate than the first one. Testing such special positions first would speed things up. That assumes, of course, that the header data is fairly regular in nature.

jkaptur
0 replies
1d

I think that would require calculating the hash, which was ruled out as too slow.

unusualmonkey
4 replies
18h29m

My 2c:

1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?

2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?

saagarjha
0 replies
14h25m

If an engineer spends a couple of days on it then the savings seem worthwhile?

cryptonym
0 replies
8h28m

That's such a tiny corner case but yet something you can easily publish. Fuelling marketing content to ends with hundred geeks discussing if what CF is doing makes sense or how much they care about perf.

That's million dollars.

aetherspawn
0 replies
16h28m

It might only be $XX,XXX but it is forever.

Power savings forever.

Lower carbon emissions.

etc

Nice to see a company that cares about making things 1% faster rather than trying to analyse headers with AI to sell me sandals or something stupid like that.

_zoltan_
0 replies
11h8m

I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.)

o11c
3 replies
23h34m

That late in the process, does it even need to support efficient lookup? Could it just be a Vec?

menaerus
2 replies
21h51m

They don't explicitly say that in the blog post but I assume that the engineering team behind this code does not run any workload performance experiments themselves but they only run the local (micro)benchmarks.

Because of this I think there's no way for them to answer your question or to say "hey, this could very well be a vector since we could not find any statistically significant impact between a vector, hash-map and trie in this code path for given workload".

This is a common fallacy I've seen in too many places. In other words, investing non-trivial engineering effort into optimizing a code-path that resembles .000001% of your workload is a waste of resources.

And due to the Amdahl's law

Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

this is not how things usually work out in the end. Making small improvements in insignificant code-paths will still remain insignificant in the overall runtime performance.

vlovich123
1 replies
21h8m

If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted

Look at the end where they validate that the performance wins show up in production in a similar way that the microbenchmarks predict.

menaerus
0 replies
11h14m

They only measure the CPU utilization of clear_internal_headers and they don't show what was the impact of the change to the overall runtime performance. The latter is what I was referring to.

maciek12
3 replies
23h48m

Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.

vlovich123
2 replies
21h10m

You could have a header that differs from an internal one only on the last character and then you have to read the whole thing

When people talk about BigO they usually talk about average Big-o and worst/best case Big-o is explicitly called out if mentioned, so you can't just think in terms of the worst possible case. Regardless I think they made several mistakes here as you note correctly.

BTreeSet is log(N) in number of nodes compared. But for strings that comparison is O(L) since you at least have to compare it to the string you find. So I believe it's at best O(L log(N)) where L is the average string length and N is the number of nodes which is worse than the hash table which is just O(L). It's tricky though if most strings don't match your string. In that case I believe you end up degrading to an O(L) + O(log(N)).

Similarly, you are correct that they can't be logarithmic in size and must be linear since you have to store the data. Tries can be sublinear depending on the input since it's a form of compression as well.

Similarly you are right about trie complexity not being O(log(L)) for trie misses. I think what they're observing is a massive speedup because mismatches error out on the first character usually. But it wouldn't be logarithmic as there's unlikely to be a logarithmic relation between headers that mismatch and the universe of matching words.

maciek12
1 replies
20h46m

Thanks for replying, I really appreciate you taking the time to check my concerns! I was afraid I was going crazy haha.

The part about big O being usually about the average case was helpful, I'm still at uni where we mostly talk about worst case performance.

vlovich123
0 replies
19h25m

I'll note I could very well be wrong - very back of the paper first principles derivation and could easily have made a mistake.

If you're in a relevant course, might be a good opportunity to flag this to a professor or TA as a way to review. These kind of analyses can be tricky sometimes, especially if you need to prove it vs just do an answer by feel which is how engineers end up doing it (vs CS students that take the proofs much more seriously).

A way to sometimes double-check is to just simulate performance as input grows and see if it matches the expected log(N).

IncreasePosts
3 replies
1d2h

Off topic: how does 60M requests a second stack up with other big players like google, Facebook, TikTok?

joshuamorton
0 replies
22h19m

I will note that http requests and internal requests are distinct.

The Zanzibar white paper from Google in 2019, for example lists top-line rpcs to Zanzibar at around 10m qps, but internal (in memory) cache hits at 200m qps and non cache internal fan-out at >20m qps. Zanzibar of course isn't http, but is a good public example of internal rpc fan-out.

Id expect cloutdflare similarly has some extremely high qps internal services.

why_only_15
0 replies
16h40m

People typically quote google search as ~300k QPS

wolf550e
2 replies
1d

I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?

steveklabnik
0 replies
22h5m

why not show the full bit pattern in memory?

Rust doesn't guarantee a particular memory layout except for types explicitly requesting that you get a specific representation, so showing this would at least require a caveat that it's not guaranteed to be accurate.

Furthermore, this specific data structure's internals are generic, so to show the memory layout, it would have to be for a specific instantiation.

(They do also provide a structure that includes instantiations between u8 and u256, mentioning in a doc comment that the 256 size dominates overall size, and so if you want to slim it down, use the internals directly.)

SkiFire13
0 replies
23h39m

By looking at the source code [1] they have a discriminated union with 3 variants representing the possible combinations of representing a valid word and leading to more nodes (excluding the case where it is neither, which is impossible). So the node for the 'o' in "do" should be a `SearchOrLeaf` storing "do", its corresponding value (the `T`, in a set this would be empty) and the `SearchNode` containing the successors, which should only contain a 't' to "dot".

[1]: https://github.com/cloudflare/trie-hard/blob/3d8163ca2d9208c...

sophacles
2 replies
1d2h

So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?

What about for the other integer sizes that the trie supports?

Validark
0 replies
4h14m

If the bit twiddling is independent in each u64, then probably. If it's treated as one giant bitstring, probably not.

FridgeSeal
0 replies
19h20m

A lot of Rusts' optimisations are from LLVM, so if that's smart enough to, then I assume it also benefits.

pragma_x
2 replies
23h55m

It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.

One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.

https://en.wikipedia.org/wiki/Bloom_filter

samatman
0 replies
21h37m

There are significant differences between trie-hard and a Bloom filter. Bloom filters are probabilistic, and use hashing. They're great for when rare false positives are acceptable in exchange for no false negatives, which isn't uncommon, but isn't what's needed here: this is exact, and hashing is the step to beat.

Rather, this is a refinement/variation on Liang's algorithm, used in TeX for storing a hyphenation dictionary. The thesis mentions Bloom's algorithm, which it calls "superimposed coding", and very much hearkens back to a time when memory was often the most precious commodity. I think you'll like it. ^_^

https://tug.org/docs/liang/liang-thesis.pdf

gcbirzan
0 replies
6h50m

The problem is that bloom filters are good when your dataset is large and calculating several hashes per item is not as expensive as looking it up in the original dataset. Here, they say that even one hash is expensive, so bloom filters would be much slower than the original hash map.

aidenn0
2 replies
1d2h

Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.

sjf
1 replies
10h12m

That is what I immediately thought as well. The problem though is that a perfect hash is still going to be O(key length) for the hash function, which they are trying to avoid.

In theory, if they used a regex it would be using a state machine to do the matching, which should have similar performance to the trie- only O(k) in the worst case. But from what I understand regex libraries don't actually build the state machine, they use backtracking so the performance is no longer O(k).

I'm surprised they couldn't find a performant regex library that already existed that uses state machines. It should have similar performance to the trie. But in reality, the performance is affected more deeply by things like memory access patterns and the performance of specific arithmetic operations, so it's impossible really to speculate.

lesuorac
0 replies
3h50m

I wonder how much faster it gets when you shorten the maximum internal key length. If you re-map everything to like "cf-1", "cf-2", .. "cf-;" then you'd only need to hash like 4 characters.

y04nn
1 replies
20h55m

Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.

intelVISA
0 replies
14h19m

The fabled tech talent shortage, probably.

xyst
1 replies
23h24m

It’s a bit wild to me that one needs to add so many http headers internally that they need to get stripped off.

The “trie-hard” crate is a solution to the layers of cruft at CF

FridgeSeal
0 replies
19h19m

I don't think it would matter if there's only 1 header to be removed, it's still a problem to be solved.

alexchamberlain
1 replies
22h19m

Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.

vlovich123
0 replies
21h23m

No, you'd just be smearing the work at best. The expensive part is determining set containment not updating the data structure. It doesn't matter if you do it in a loop and update the data structure or do that check before writing.

Buckaroo9
1 replies
11h21m

Shouldn't this line: > "using regular expressions clocks in as taking about twice as long as the hashmap-based solution." be written as the opposite? That the RE takes half as long/is twice as fast?

nitnelave
0 replies
10h29m

I think the point is that naive regex are a very generic purpose tool, but it's still in the same ballpark. Having a custom optimized state machine for this specific use case could bring another 5x improvement on top, leading to 2.5x faster, potentially.

sixo
0 replies
21h36m

this post makes me think I could work at cloudflare, and also that I shouldn't

pianoben
0 replies
16h13m

This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!

...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/

mwexler
0 replies
5h3m

I know, they aren't the first to use it, but I still appreciate the pun.

mgaunard
0 replies
22h35m

If it takes 3.5 us to remove a substring within a packet-sized string you're doing it wrong.

hencoappel
0 replies
1d

Given it's such a hot part of their code, I'm genuinely surprised they didn't go down the state machine route for as optimal a solution, or even further and made use of vector instructions. Maybe Rust can autovectorise the code though.

gwittel
0 replies
18h54m

Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial.

gregsexton
0 replies
3h37m

The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.

A very simple analysis in R:

    > prop.test(c(9, 4), c(1103,1171))

    2-sample test for equality of proportions with continuity correction

    data:  c(9, 4) out of c(1103, 1171)
    X-squared = 1.4915, df = 1, p-value = 0.222
    alternative hypothesis: two.sided
    95 percent confidence interval:
    -0.002409831  0.011897193
    sample estimates:
        prop 1      prop 2 
    0.008159565 0.003415884
A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.

I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.

bhawks
0 replies
12h2m

Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?

Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.

Validark
0 replies
3h56m

It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.

I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.

The CPU is fast, memory is slow.

MehdiHK
0 replies
1d1h

Slightly off topic: I was a bit surprised nobody ported space-efficient marisa trie to Rust, or any other languages.

Anybody knows why?

https://github.com/s-yata/marisa-trie