HN comments for: A good day to trie-hard: saving compute 1% at a time

amluto

60 replies

1d3h

2024-09-10 15:21:54 UTC

If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:

- An entire separate dictionary or other data structure.

- One single header containing all internal metadata.

- All headers have a prefix, and the internal ones start with I and the external ones start with E.

- All internal headers start with “CFInt”.

I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)

The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.

efitz

16 replies

1d3h

2024-09-10 15:32:05 UTC

Ive worked at several huge corporations in IT security, where we care about headers a lot, and they all use headers in a manner similar to CloudFlare.

Including using proxies at the edge to strip out internal headers bidirectionally- yes, inbound too.

sitkack

12 replies

1d2h

2024-09-10 15:43:42 UTC

That doesn't make it better, it makes it worse.

hinkley

11 replies

1d2h

2024-09-10 15:49:43 UTC

Are you on a corporate network? Do you use a firewall at home?

You’re on enclaves all the time. This is just a different one. Separate networks per class of traffic used to be de rigeur before Cloud. Now it’s all munged together.

sitkack

9 replies

1d2h

2024-09-10 16:22:34 UTC

I understand that security needs to be pragmatic, but because security does it, doesn't make it right.

It isn't about being in the enclave, it is having to keep track of what headers you set vs external. It is fragile and error prone and it will absolutely break someone else when there is a name collision.

All security exploits are a subset of bugs.

hinkley

8 replies

1d1h

2024-09-10 16:35:26 UTC

And what would you have them do instead? SSL to every service? Reauthenticate on every single call? What’s your clever solution?

There’s probably a dozen things you can terminate at a boundary that would cost so much to run through the entire system that it would bankrupt a company.

And then there’s tracing, which also uses headers.

immibis

6 replies

1d1h

2024-09-10 17:07:22 UTC

one of the listed options was to begin all the Cloudflare internal header names with "CFInt"

hinkley

5 replies

1d1h

2024-09-10 17:13:37 UTC

That’s still header filtering. And it should be X-CF- by the way.

amluto

2 replies

1d1h

2024-09-10 17:24:28 UTC

X-CF- seems to be used by some other software. And Cloudflare has plenty of documented headers that don’t start with X.

The whole point is that a genuine user header should never conflict with a Cloudflare internal header, and a browser should never see an internal header.

sitkack

0 replies

2024-09-10 17:45:32 UTC

Having re-read the thread, I totally agree. What you have listed is table stakes. I'd also say that internal headers would also be encrypted to a narrow protection domain.

If all internal headers were prefixed with X-CF-, you could strip them all via SIMD that had no knowledge of any specific header. Hell, you could probably do it on the NIC.

hinkley

0 replies

1d1h

2024-09-10 17:30:26 UTC

Please reread the entire thread. We are talking about someone who thinks the header solution is stupid. You are both splitting hairs. Stay on target.

faitswulff

1 replies

2024-09-10 18:12:35 UTC

It took me about 10 years to hear about it, but the IETF tried to deprecate the X- prefix back in 2012: https://datatracker.ietf.org/doc/html/rfc6648

hinkley

0 replies

23h49m

2024-09-10 18:43:06 UTC

Good to know.

(Also sibling is right, I spaced on X-CF- being a header sent to CF customers’ servers. I don’t used cloudflare but cloudfront does the exact same thing)

8n4vidtmkvmk

0 replies

13h0m

2024-09-11 05:31:45 UTC

Maybe I'm dumb or missing something, but why not just do what Google does -- convert all inbound HTTP requests into Protobufs, separate out the internal stuffs from the external stuffs, run it through your myriad of internal services, and then on the way out you still have your nicely delineated protobufs which you can convert back into HTTP. Why are we mucking about with headers in the first place?

the8472

0 replies

23h45m

2024-09-10 18:46:50 UTC

Configure all machines in a way that they can survive on some random untrusted public wifi. This should be the obvious default stance for laptops and phones.

But even for workstations wired to a trusted LAN it still makes sense because you never know which of the various tunnels might assign and expose some IPv6 address to the internet.

For servers you might be able to make an exception if you have a vigilant IT and the people with admin access aren't actively trying to circumvent security, but even then that better not be your only layer of security.

hinkley

2 replies

1d2h

2024-09-10 15:47:40 UTC

The first large scale piece of software I worked on was for telcos pre smart phone. We used internal headers to offload authentication and terminate SSL. We also had to pressure F5 to fix about half a dozen bugs in BIG-IP to do so. Bugs that should in no universe have existed in version 9 of a product.

I used to joke that F5 owed me and my coworker 3 months of salary for all the free QA we did for them.

tmsite

1 replies

1d1h

2024-09-10 17:10:00 UTC

It helps if you realize that BIG-IP 9.0 was essentially a from-scratch rewrite of BIG-IP 4.5. Among other major redesigns, the data plane was moved from BSD kernel space to Linux user space. Internally, the joke was that it would be two times better when we were done (4.5 * 2 = 9.0). It probably was, but not on day zero.

hinkley

0 replies

1d1h

2024-09-10 17:17:26 UTC

On day sixty it couldn’t do SSL termination and cookie based traffic shaping didn’t work.

I was a little bummed it was “just” a Linux box but that’s pretty common today. I hadn’t discovered dd-wrt yet and wouldn’t for a few years.

jgrahamc

15 replies

1d3h

2024-09-10 15:23:57 UTC

Yeah, me too, but systems grew over time and grew and grew and we were using HTTP headers for all sorts of stuff. This optimization makes the situation better, but the solution (which is underway) is to use a different mechanism for IPC and get rid of the use of HTTP headers completely.

amluto

6 replies

1d2h

2024-09-10 15:39:57 UTC

I’m certainly familiar with systems growing over time and outgrowing their original schema.

Is this the same phenomenon that resulted in Cloudflare Tunnels apparently being domain names? Or why domains that are proxied by Cloudflare show up as “CNAME” in the DNS panel? (The latter one seems extra strange — wasn’t the reverse proxy service the very first service that Cloudflare offered? There must be some history here.)

hinkley

5 replies

1d2h

2024-09-10 15:52:40 UTC

All new products are full of scaffolding that has to be removed when robustness becomes a higher priority than existence. Where we fail is not calling it that. Instead we just call some code “good” while our thicker skinned coworkers call the rest “fine”. It’s not fine. I don’t want to be working overtime the week of thanksgiving because you can’t think past the horizon.

gmueckl

4 replies

2024-09-10 17:50:37 UTC

There is a very fine line between "good enough" and "not good enough" in any product beyond a certain complexity. Finding the pieces that cross that line can be hard and improving "good enough" parts is (sadly) mostly a waste of time in commercial settings.

hinkley

2 replies

2024-09-10 18:01:28 UTC

You have to backdate starting a project so it happens before things stop being good enough. And take into account bad estimates and the five other deadlines between then and now.

Often it’s better to clean as you go. If you have the time and the inclination, shore up one of the least good enough things.

It’s not unheard of to open up new feature sets via refactoring. Something that previously would have taken an “impossible” amount of time now becomes profitable, not just possible.

dmorgan81

1 replies

17h18m

2024-09-11 01:14:29 UTC

I miss working at a place where that was encouraged. Now if I do that the testers complain that it’s not in scope for the ticket and to split the changes into another ticket with a separate test plan. The rot thus continues.

hinkley

0 replies

16h58m

2024-09-11 01:34:07 UTC

I think this is about when I discovered zone defense for performance optimization. QA doesn’t want to chase your code changes halfway across the codebase, and that’s fair. At some point it starts looking like you’re changing things just to change them, even if it ends up being a few percent better.

But if you limit yourself to one part of the interaction, they don’t get as mad. A dozen changes in the signup system piss them off less than five changes across the app.

didgetmaster

0 replies

2024-09-10 18:19:26 UTC

This is why some bugs never get fixed (that is if you define a bug to include inefficient code and not just code that breaks something).

DamonHD

5 replies

1d3h

2024-09-10 15:27:40 UTC

And when that happy day dawns, slightly sadly, this nice optimisation will evaporate!

jgrahamc

4 replies

1d2h

2024-09-10 16:09:10 UTC

Maybe the real treasure was the optimizations we made and discarded along the way.

DamonHD

2 replies

1d1h

2024-09-10 16:33:00 UTC

Indeed. I am very happy with some of the tricks that I developed (speeding up, and indeed allowing, Solaris C++ linking by prefiltering on MD5 hashes of canonicalised .o files -> nm, anyone?) even though they were lost without trace (yep, Lehman Brothers, where are my Imakefiles now?)...

jkaptur

1 replies

1d1h

2024-09-10 17:32:03 UTC

They're probably as critical as ever at Barclays :)

DamonHD

0 replies

2024-09-10 17:53:08 UTC

I really hope not, given that that very old Solaris compiler should have been laid to rest many many years ago. Yes, some of my stuff had kinda made it into BarCap (even onto my desktop dev machine!) when I turned up there later, and even some at Nomura when I rocked up there again IIRC!

Twirrim

0 replies

23h16m

2024-09-10 19:15:49 UTC

My likely overly cynical concern here is that this suggests trie-hard will soon end up abandoned, as you're making it sound like it is a stop-gap solution for you.

throwaway77385

0 replies

1d3h

2024-09-10 15:28:08 UTC

For anyone not in the loop, the above poster (JGC) is Cloudflare's CTO himself :)

alex_suzuki

0 replies

1d1h

2024-09-10 17:32:21 UTC

You chiming in on this post makes me slightly bitter at having gone with CloudFront (I have everything on AWS so it seemed the obvious choice) instead of Cloudflare.

hn_throwaway_99

9 replies

22h7m

2024-09-10 20:25:24 UTC

I feel like I can point out lots of similar problems to the other solutions you suggest (heck, I thing some of those problems you list even apply to those other solutions).

The list approach has some downsides, but it also has a bunch of upsides. I feel like when people like to point out potential flaws of these approaches, they're ignoring the history and difficulties that comes with Cloudflare's scope. An enumerated list is the simplest and most flexible of the approaches, and it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare, potential technology acquisitions, etc.

Aeolun

6 replies

18h38m

2024-09-10 23:54:30 UTC

it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare

That’s more of an indictment of how little effort is spent aligning things. It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

poincaredisk

1 replies

17h25m

2024-09-11 01:07:20 UTC

* Headers that are originally private, but are later changed to public

* Headers that are intended to be public, but then are changed to private

* Headers that were created in the very early days of the company, which are so deeply integrated into everything that renaming them is basically impossible (or at least very hard and risky)

* Headers that are set by some appliance/third party solution not controlled by CF, but which are technically internal and must be stripped.

* Newly acquired company has its own set of internal headers. Management is not amused when a huge, disrupting refactor is suggested by engineers.

And this is just a tip of the iceberg of possible problems when your scale is big enough.

bhawks

0 replies

12h49m

2024-09-11 05:43:36 UTC

So because everything is a Byzantine labyrinth due to scale we're all going to agree to remember to add internal headers to strip to another service?

Maintaining the header to strip list is just another piece of technical debt to keep paying down.

Having a policy to strip everything under a prefix seems much less error prone.

ketzo

1 replies

18h6m

2024-09-11 00:26:14 UTC

It’s not that hard

How many years do you have to spend in software before you stop uttering those four cursed words?

Aeolun

0 replies

12h27m

2024-09-11 06:04:46 UTC

It takes just as much experience to acknowledge that the cure is worse than the disease.

hn_throwaway_99

1 replies

15h44m

2024-09-11 02:48:21 UTC

It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

To be a little bit blunt, this just means you have never built software in a very large company that may have a lot of existing/completely independent software, or have made technology acquisitions after the fact. That is, if you are starting from "time 0" of course it's very easy to say "internal headers need to start with CFInt", but that's rarely the case.

Aeolun

0 replies

12h25m

2024-09-11 06:06:55 UTC

I think if you work for such a very large company the only reason you are ok with it is because nobody at the top says ‘no more’.

If they do, it suddenly becomes very easy to get everything aligned, because there is likely a vast collection of bodies just collecting a paycheck that can be more productively assigned to manually renaming everything.

amluto

1 replies

20h4m

2024-09-10 22:28:13 UTC

Lists, extensible protobufs, etc are indeed great in their extensibility. The issue I would see (if I worked at Cloudflare) isn’t to using a list — it’s that the list is the same list that’s used for externally visible HTTP headers.

hn_throwaway_99

0 replies

15h41m

2024-09-11 02:50:53 UTC

I don't see that as a problem at all:

"Hey any teams out there, before you can add internal headers, you need to register them with some service." Then the list is loaded from static storage at startup time and/or refreshed on some periodic basis.

taeric

5 replies

1d2h

2024-09-10 16:04:58 UTC

Other ideas:

  - Ensure all headers added internally that are not for export at the front of the header list
  - Preseed the hashtable of all requests with internal headers you plan to remove.

In fact, if you preseed, you are basically combining these ideas but fixing how many internal headers are on each request. At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

amluto

4 replies

1d2h

2024-09-10 16:17:44 UTC

At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

While Python provides data structures like this out of the box, designing a big system to require a special, nonstandard data structure to identify private metadata seems like a mistake to me.

taeric

3 replies

1d1h

2024-09-10 16:36:38 UTC

I'm not sure I follow? Does rust not have the equivalent of a LinkedHashMap from Java? All I'm proposing is to start each map of headers off with a sentinel value for all of the headers you don't intend to send back to the client. Then, have a method that iterates over all values, for the places you need to send to internal servers, and another (probably the default?) that starts iteration after all of the internal items, which should just be a different pointer. At this point, there is no need to inspect each header by name, as that was effectively done at compile time when you put the protected headers at the front of the list.

Is this somewhat special? I mean, sure. But it wasn't long ago that a hash table was already in the special territory of data structures.

Edit: I should add that I'm not certain this idea would be better. Just more ideas on what you could do.

amluto

2 replies

1d1h

2024-09-10 16:46:43 UTC

There’s a third party crate for this. C++’s STL also doesn’t have it by default.

The creation-order-preserving hash map is basically two data structures in one, it’s more complex and slower than a normal hash map, and IMO it’s rather bizarre and I have never really understood why anyone wants one.

taeric

0 replies

1d1h

2024-09-10 17:06:02 UTC

I would be surprised if it was dramatically slower, all told. Though, fair that I have not benchmarked in a long time, here.

My main "idea" here is to stop from having to check all of the headers on a "per header" basis. Yes, you can make each check faster, as the TRIE does. You can also remove the entire reason to check individual headers by starting a traversal after where the internal headers are.

You could also go with something like making sure the first header is "InternalHeaderCount: N" where you then keep the internal headers segregated form the others by a simple count. (This assumes you have solid control over what is adding the headers internally, of course.)

(I also forgot to give kudos on the blog title. Any Die Hard reference is a good reference. :D )

Conscat

0 replies

1d1h

2024-09-10 17:04:33 UTC

It's especially unfortunate that the intuitive name, `std::map`, is that ordered and generally least useful option of the standard library's three hash map containers.

My last job did rely on ordered-ness of `QMap`, because the elements are listed in a Qt GUI and it could confuse users if those widgets randomly rearranged themselves. That's the only use-case I've personally encountered for an ordered hash map.

e63f67dd-065b

3 replies

23h56m

2024-09-10 18:35:52 UTC

Former employee here; the interesting (horrifying?) fact is that you can set some of these internal headers (I remember a notable bug with cf-cache-status) in workers (the serverless offering) and cause all kinds of bad things to happen :)

Aeolun

2 replies

18h40m

2024-09-10 23:52:05 UTC

It seems to me it should be trivial to strip anything internal that comes out of a worker right?

ketzo

1 replies

18h7m

2024-09-11 00:24:52 UTC

Trivial unless someone wants to modify CF internal headers as part of their solution…

https://xkcd.com/1172/

virtue3

0 replies

17h26m

2024-09-11 01:06:34 UTC

Haha thank you

620gelato

3 replies

1d2h

2024-09-10 15:54:47 UTC

Or perhaps even, insert yet another header with just the list of internal headers being added to the request, assuming this happens at a single place, otherwise a recipe for disaster.

I have a slightly different example of this, where a rpc framework used in my company disallows the service owner from modifying certain headers (say request identifier), and will instead create a duplicate header with a suffix. In that scenario at least, I can see this as a fairly reasonable tradeoff, as the goal is to control certain headers from being modified, not because they are platform internal but because there are certain assumptions associated with those headers that are followed company-wide.

I'll go check what mechanism is used for this matching.

adrianmonk

1 replies

22h31m

2024-09-10 20:01:11 UTC

yet another header with just the list of internal headers

Or the same but with a list of headers which AREN'T internal.

You'll probably have a custom header-adding function that people should always use instead of the regular one. And this way, if someone forgets to use it, their header will get stripped.

You can think of a header escaping the internal network as something that needs to be authorized. This is a deny by default approach.

620gelato

0 replies

21h54m

2024-09-10 20:38:18 UTC

Ah yes - perfect.

amluto

0 replies

1d1h

2024-09-10 17:26:01 UTC

This doesn’t work if the front end is older than an internal node and a header gets added, at least not without extra logic to patch everything up and handle conflicts.

The HTTP Connection header works like this, and one should generally assume that almost nothing implements it correctly.

giancarlostoro

0 replies

1d1h

2024-09-10 17:09:33 UTC

What if someone else uses this name?

Couldn't you bypass this by pre-fixing the ones that aren't yours? Or prefixing your internal ones with something unique enough?

danbruc

0 replies

20h41m

2024-09-10 21:51:06 UTC

This might of course introduce quite a bit of overhead, but the clean solution would arguably be to not mix requests to begin with, the external requests you are processing are the payload of the requests you are sending around internally. This would also allow you to cook up an optimized encapsulation protocol that might be more efficient to process than parsing and modifying HTTP. This admittedly comes from someone with essentially zero knowledge of what exactly Cloudflare is doing to the requests they receive.

anonymoushn

0 replies

19h13m

2024-09-10 23:18:54 UTC

X years ago the manager of the cache team raised the lack of data plane and control plane separation as an issue and we came up with various plans to fix things, but I guess nothing has happened since

mightyham

18 replies

1d2h

2024-09-10 16:23:52 UTC

I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.

supermatt

14 replies

1d2h

2024-09-10 16:30:05 UTC

They mentioned it is the overhead of the hashing that makes it slow compared to the trie.

hooli42

7 replies

1d1h

2024-09-10 16:35:26 UTC

Hash functions can be as simple as a single modulo.

torusle

3 replies

19h13m

2024-09-10 23:19:28 UTC

Or as simple as using the hardware accelerated CRC32 that we have in our x86 CPUs.

Last time I checked, CRC32 worked surprisingly well as a hash.

hooli42

2 replies

7h29m

2024-09-11 11:02:48 UTC

Hah, neat.

The 'weird' instructions can often be 3-4 orders of magnitude slower than arithmetic instructions, though I doubt that matters here.

Validark

1 replies

4h22m

2024-09-11 14:10:01 UTC

CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).

hooli42

0 replies

49m

2024-09-11 17:43:15 UTC

Sure, in the same way SIMD instructions get a linear speedup in theory.

If you Google the words "CRC32 is slow", you can see hundreds of people complaining about this.

ironman1478

2 replies

1d1h

2024-09-10 17:03:57 UTC

As they said though, the hash function on a string type requires looking at every element in the string. Also, modulo is historically very slow. So, they still have to deal with O(n) operations, where n is the length of the input string. If they can improve the memory hopping problem associated with lists / graph structures (which they claim to have done in their library), then a trie could would be much fast enough, which is what they observed. Combined with the fact that they claim that 90% of the time there is a miss when querying the trie, then you exit early a lot, whereas you always need to compute the whole hash on the input string when doing the hash map strategy.

mightyham

0 replies

1d1h

2024-09-10 17:30:07 UTC

The hash does not need to be computed on the whole string. I pointed this out in my other comment but just as a example: a hash function could be as simple as xoring the first 16 and last 16 bits of the string then indexing a 2^16 array. That means hashing is two pointer offsets and an xor (no modulo required). If there are 100 strings that need to be removed, then ~99% of rejections will be a very fast O(1).

And in the case of a match, a fast hash + memcmp will be way faster than a trie traversal. In fact, according to the trie-hard github readme, std::HashMap is already much faster than their trie implementation when searching for a match.

hervature

0 replies

1d1h

2024-09-10 17:23:55 UTC

As they said though, the hash function on a string type requires looking at every element in the string.

Maybe for the default hash function. As another commenter pointed out, your data may make the following hash very effective: s[0] + s[len(s)//2] + s[-1] which would be very fast. The point being is spending a day seeing if such a hash exists is worth it.

mightyham

4 replies

1d1h

2024-09-10 16:36:47 UTC

The point I'm making is that a specially optimized hashing function would probably blow away any trie traversal. When you know the data being hashed before hand it is possible to make custom hash functions that are as simple as a few bitwise operations on the first, middle and last character of the string (just an example).

treis

0 replies

1d1h

2024-09-10 17:19:55 UTC

Isn't the sticky wicket collisions with external to CF defined headers?

dividuum

0 replies

1d1h

2024-09-10 16:56:33 UTC

Right. Out of curiosity I looked into what code gperf generates and it seems it would end up O(1) for misses (it's always three lookups into a generated table) and O(L) for hits, as a potential hit has to be confirmed by essentially a strcmp. Not sure how something like https://github.com/rust-phf/rust-phf would work, but probably similar?

dathinab

0 replies

10h20m

2024-09-11 08:12:38 UTC

if all your internal headers start with cf- and most non-internal headers don't (but not all) and most headers are non-internal a trie might be hard to beat

Denvercoder9

0 replies

19h38m

2024-09-10 22:54:35 UTC

They don't know all the data that's being hashed, though. They know everything that's in the map, but not all the keys they're trying to lookup.

vlovich123

0 replies

2024-09-10 17:34:33 UTC

If they just used the default, that's SipHash which is optimized for security and DOS resistance, not performance. XXh3 is ~10x faster than that and there's some newer ones that are even faster than Xxh3 like GxHash (like another ~1.5x-2x faster).

You would precompute the hash for the constant string list & then compute the hash only for for the set of strings that begin with C/c since "cf-" is a common prefix for those internal headers if I recall correctly, using RawTable to find entries by hash & then comparing against the precomputed value to see if a collision matched before checking a string for equality.

5kg

1 replies

1d1h

2024-09-10 16:36:19 UTC

There is FxHashMap (https://github.com/rust-lang/rustc-hash) which is faster than std::collections::HashMap.

With ~100 static entries, o1hash (https://github.com/rurban/smhasher/blob/master/o1hash.h) should also work.

rapsey

0 replies

2024-09-10 17:57:09 UTC

Note FxHashMap is just the std HashMap with a different hashing algorithm.

samatman

0 replies

21h35m

2024-09-10 20:56:50 UTC

A hash function can't reject a string with a single test of the first byte. It just can't.

For this application, that's a leg up which a hash won't be able to touch. The rest of the art here is shaving down the constant factor of a trie far enough that the initial boost pays off in real-world performance.

hyperpape

7 replies

2024-09-10 17:43:40 UTC

I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.

burntsushi

6 replies

2024-09-10 18:18:05 UTC

regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.

But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences:

* The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects.

* The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.)

* The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route.

* The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance.

* trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie.

vlovich123

2 replies

21h21m

2024-09-10 21:11:00 UTC

can use up a lot more memory and take longer to build

Does it support const construction so that you build it during compilation from a static set of strings?

burntsushi

1 replies

21h15m

2024-09-10 21:17:19 UTC

Nope. That doesn't usually help in practice because it's not hard to amortize construction with a std::sync::LazyLock.

(The `memchr` crate doesn't even support const construction of a single substring searcher. I haven't spent much time thinking about it because `const` is still pretty limited in what it supports, although progress is being made. For example, you can't use trait methods in `const`. So while you can undoubtedly use `const` Rust to build a single substring searcher, whether `const` Rust can be used to build the one in `memchr` specifically isn't clear to me. And if it could be done, it will undoubtedly come with development/code-clarity costs and it's not clear to me that those are worth paying. Because as I said, amortizing construction for a static set of strings is usually very simple to do.)

And in any case, "takes a long time to build and uses a lot of memory" isn't avoided by doing it at compile time. Instead, you get bloated binaries and longer compile times. Which costs matter more? Dunno, especially when you amortize construction at runtime.

vlovich123

0 replies

21h3m

2024-09-10 21:28:58 UTC

Yeah I guess I was mainly just asking "is const powerful enough yet for this to be a trivial addition".

Instead, you get bloated binaries and longer compile times

Longer compile times probably are solved by having the strings live in a standalone compilation unit which the compiler will end up caching and you can only dirty it by adding a new string. Needing to shuffle that extra data around might still cost extra compile time, but in a real world program I suspect that's likely to not matter (both in terms of binary size & that other things in the program will be more expensive to compile).

That being said I agree that doing a LazyLock is probably sufficient and given the amount of entries being compared against acquiring the lazy lock probably also gets amortized (it's cheap but not free).

samatman

2 replies

21h15m

2024-09-10 21:17:17 UTC

This is a gratifying read (along with the article) because I've been working on closely related things, and had come to the tentative conclusion that taking Aho-Corasick, anchoring it (so no failure transitions), and laying it out in contiguous memory, basically results in Liang's packed trie from TeX's hyphenation algorithm.

trie-hard seems to be effectively that, but with some clever masking to get mileage out of having a limited byte vocabulary to deal with.

The DFA might pull ahead, but I wouldn't just bet on that. Branch predictors don't like lookup tables much, and the memory locality hit is small but real: more of a packed trie fits in a cache line than would a DFA.

You seem to have landed on similar conclusions here: beating a packed trie for this specific use case is going to be tough.

I also like that they hit on the same popcount-and-shift technique I've used in a data structure for UTF-8 character sets: https://github.com/mnemnion/runeset. I admittedly haven't benchmarked this much, correctness first and all, but seeing that Cloudflare has seen good numbers for something closely related reinforces the few tests I have run.

As a text-matching connoisseur, you may find the runeset interesting. It turns out to be hard to search for papers on "UTF-8 character set", so if you're aware of any prior art here I'd love a heads-up. It's fundamentally a generalization of using u64 bitmasks for sets of ASCII, which is as old as the hills at this point.

burntsushi

1 replies

20h50m

2024-09-10 21:42:32 UTC

RE DFA: possibly, but not completely obvious to me. The DFA in aho-corasick doesn't typically have 256 different transitions per state. It uses "byte classes" to represent a typically smaller number of equivalence classes. Two bytes are in the same equivalence class if a match/non-match cannot be distinguished between them.

To be clear, I'd bet trie-hard is still more space efficient. And using byte classes does require an extra look-up per transition.

But, worth measuring.

As your your RuneSet thing, I'd look at lightgrep and icgrep. I'm not 100% they have relevant work for you, but they might. In practice, I just combine sets of codepoints down into UTF-8 automata since that's needed anyway for lazy DFA matching.

samatman

0 replies

16h40m

2024-09-11 01:52:14 UTC

not completely obvious to me .. worth measuring.

I concur.

lightgrep and icgrep

These are great tips for a "related work" section, but nothing like prior art as it turns out. So yes, relevant and appreciated.

UTF-8 automata .. lazy DFA matching

I wouldn't expect RuneSet to be useful for any automata-based regex pattern matching approach, there's not a good way to incorporate the context switch, among other considerations, and while it's fast, I will be surprised (very pleased, but surprised) if it's faster. It's designed as a component of an LPeg style parsing machine, where it can be added inline and treated as an unusually large instruction. I haven't done more than time its test suite, which has shown that it's fast enough for the intended use, and that optimization time is best put into other parts of the design for now.

miso2024

6 replies

1d2h

2024-09-10 15:45:22 UTC

Finally a blog post with a trie. Those trie leetcodes didn't go in vain ;)

marcosdumay

1 replies

1d1h

2024-09-10 16:46:30 UTC

You will find them in any fast spelling or grammar verifier.

koito17

0 replies

21h52m

2024-09-10 20:40:33 UTC

Routing libraries as well. In Clojure, the reitit library uses a prefix tree implementation when the route table is large enough for linear search to be non-optimal.[1] The Go ecosystem also has httprouter, which uses prefix trees under the hood.

[1] https://github.com/metosin/reitit/blob/master/doc/performanc...

[2] https://github.com/julienschmidt/httprouter#how-does-it-work

steve_adams_86

0 replies

2024-09-10 17:51:22 UTC

I’ve used them in real code! They are pretty intuitive for some use cases. The last time I used a trie was for address lookup. I even used a generator, haha. Double whammy leetcode in real life situation I guess.

It’s definitely not a common tool I reach for though.

jitl

0 replies

12h38m

2024-09-11 05:53:48 UTC

i find myself reaching for a trie/radix-tree from time to time in medium-complexity build tools or code analysis scripts, where I need to store some information about zillions of filepaths, and factoring out the common prefixes saves a ton of memory while being pretty intuitive

cdelsolar

0 replies

2h39m

2024-09-11 15:52:54 UTC

Tries are huge for Scrabble move generators and playing programs, of which I've written a few. The fastest structures are compressed tries with rotated representations of words.

ayuhito

0 replies

23h12m

2024-09-10 19:20:28 UTC

I recently got a chance to find a practical use for a modified hybrid trie which was fun to implement! Lots of optimisations to go around.

It’s a nifty user-agent parser for Go: https://github.com/medama-io/go-useragent

Typically this sort of problem relies on lots of regex parsing, so it was nice to take a more novel approach.

cwilby

6 replies

1d2h

2024-09-10 15:46:27 UTC

Internally, we call the request’s destination server its “origin”

Origin: The point at which something comes into existence or from which it derives or is derived.

How can the request's destination server be the origin, if it is the destination server?

toast0

1 replies

1d2h

2024-09-10 15:49:09 UTC

The origin is the origin of the content.

cwilby

0 replies

1d2h

2024-09-10 15:51:40 UTC

Thus cross origin resource sharing. Thanks!

At least it's consistently inconsistent.

tills13

1 replies

1d2h

2024-09-10 15:50:29 UTC

On the internet, the origin is the server sending the response to the user. I suppose you can look at it from the perspective of the owner of the server -- from their frame of reference, their journey _starts_ when they receive, process, and respond to the request.

Granted, computer scientists are infamously known for being terrible at naming things.

hinkley

0 replies

1d2h

2024-09-10 15:54:35 UTC

Had any fun with “Referer” headers lately?

fowl2

1 replies

1d2h

2024-09-10 15:52:19 UTC

Depends on your perspective :)

"Origin" is a term from web browsers, from their point of view it refers to where a web page came from.

see https://developer.mozilla.org/en-US/docs/Web/Security/Same-o...

taeric

0 replies

1d2h

2024-09-10 15:54:36 UTC

This hilariously reminds me of the complaining about X server semantics back in the day.

kristianp

5 replies

21h35m

2024-09-10 20:57:23 UTC

I would have stopped at the first step, saving almost 1% of cpu time (0.993%). Creating a new fast trie data type for the other 0.287% seems inefficient. Surely there are other parts of the code that take more than 1% of the cpu time elsewhere to be looked at first?

FridgeSeal

4 replies

19h13m

2024-09-10 23:18:49 UTC

From the article:

At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths...function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Given how many requests are going out, and just how hot the function is, every single gain matters. If my read of the article is correct, this is the hottest part of the hot path, so every gain here is super important.

__turbobrew__

3 replies

14h15m

2024-09-11 04:16:56 UTC

680 cores is peanuts. That is less than 10 modern specced compute servers. Its not nothing but you are probably talking about $20-30k/year in opex costs, which at the scale of cloudflare is probably a drop in the bucket.

Obviously it is an improvement but a lot of time there is lower hanging fruit, like the fact that the cloudflare control plane is/was not globally distributed and could not tolerate data center outages.

FridgeSeal

1 replies

12h33m

2024-09-11 05:59:25 UTC

Idk man, I’m not Cloudflare.

That said: probably different teams with different priorities.

I also disagree that “it’s _only_ $20-30k” as some kind of reason to not do this kind of work. Think of it another way: that’s now another 680 cores that can do _net more work_, that’s another 20-30k we don’t have to spend when traffic increases. “I landed a small PR that wiped $30k net off our costs” is a statement that would make a lot of people in a business pretty happy.

__turbobrew__

0 replies

10h58m

2024-09-11 07:34:27 UTC

I work in a tech company which is a bit bigger than cloudflare and anything less than 6 figures/yr opex reduction is a rounding error. It is pretty regular to save the company 7 figures per year when opex is in the billions.

It is about the opportunity cost, nobody is going to disagree that small optimizations are a net positive but what else could you have done with your time? Maybe cloudflare is at the point of maturity where there are no easy wins and only 1% gains left but I kind of doubt it given their reliability track record. How many dollars do they lose for every hour they are down? Probably more than $20k.

cryptonym

0 replies

8h23m

2024-09-11 10:09:09 UTC

Marketing content is more valuable than 680 cores. Let the guys have fun on developing tries, as long as they produce an article.

Scaevolus

5 replies

1d3h

2024-09-10 15:31:52 UTC

Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.

jgrahamc

2 replies

1d2h

2024-09-10 16:07:50 UTC

I do not know if the team did that but Bloom filters aren't a panacea: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

vlovich123

0 replies

21h25m

2024-09-10 21:06:45 UTC

A bloom filter will have the same weaknesses of needing to do a hash. For static filters like this though where you have the set of strings to check against known up front, something like a binary fuse filter could work really well as it's significantly faster than bloom filters when you don't need online construction (you still need to do a hash of course but the evaluation against std::HashMap is flawed since it uses SipHash instead of modern constructions like xxh3 or gxhash).

Scaevolus

0 replies

1d2h

2024-09-10 16:23:02 UTC

Yeah, for this problem you'd want a bloom filter that can entirely fit in vector registers during the computation, which should work well for a few hundred items.

pragma_x

0 replies

23h43m

2024-09-10 18:49:10 UTC

Considering that each node's filter in their trie implementation is bloom-like already, the solution is awful close. I agree that testing wider swaths of the string at once might be a dramatic accelerator.

A more naive approach might also be warranted. Some frequency analysis of hit/miss for trie nodes might allow them to find specific character positions with a higher miss rate than the first one. Testing such special positions first would speed things up. That assumes, of course, that the header data is fairly regular in nature.

jkaptur

0 replies

2024-09-10 17:33:46 UTC

I think that would require calculating the hash, which was ruled out as too slow.

unusualmonkey

4 replies

18h29m

2024-09-11 00:02:56 UTC

My 2c:

1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?

2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?

saagarjha

0 replies

14h25m

2024-09-11 04:06:41 UTC

If an engineer spends a couple of days on it then the savings seem worthwhile?

cryptonym

0 replies

8h28m

2024-09-11 10:04:29 UTC

That's such a tiny corner case but yet something you can easily publish. Fuelling marketing content to ends with hundred geeks discussing if what CF is doing makes sense or how much they care about perf.

That's million dollars.

aetherspawn

0 replies

16h28m

2024-09-11 02:03:44 UTC

It might only be $XX,XXX but it is forever.

Power savings forever.

Lower carbon emissions.

etc

Nice to see a company that cares about making things 1% faster rather than trying to analyse headers with AI to sell me sandals or something stupid like that.

_zoltan_

0 replies

11h8m

2024-09-11 07:24:06 UTC

I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.)

o11c

3 replies

23h34m

2024-09-10 18:58:12 UTC

That late in the process, does it even need to support efficient lookup? Could it just be a Vec?

menaerus

2 replies

21h51m

2024-09-10 20:40:43 UTC

They don't explicitly say that in the blog post but I assume that the engineering team behind this code does not run any workload performance experiments themselves but they only run the local (micro)benchmarks.

Because of this I think there's no way for them to answer your question or to say "hey, this could very well be a vector since we could not find any statistically significant impact between a vector, hash-map and trie in this code path for given workload".

This is a common fallacy I've seen in too many places. In other words, investing non-trivial engineering effort into optimizing a code-path that resembles .000001% of your workload is a waste of resources.

And due to the Amdahl's law

Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

this is not how things usually work out in the end. Making small improvements in insignificant code-paths will still remain insignificant in the overall runtime performance.

vlovich123

1 replies

21h8m

2024-09-10 21:24:31 UTC

If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted

Look at the end where they validate that the performance wins show up in production in a similar way that the microbenchmarks predict.

menaerus

0 replies

11h14m

2024-09-11 07:17:57 UTC

They only measure the CPU utilization of clear_internal_headers and they don't show what was the impact of the change to the overall runtime performance. The latter is what I was referring to.

maciek12

3 replies

23h48m

2024-09-10 18:43:56 UTC

Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.

vlovich123

2 replies

21h10m

2024-09-10 21:21:56 UTC

You could have a header that differs from an internal one only on the last character and then you have to read the whole thing

When people talk about BigO they usually talk about average Big-o and worst/best case Big-o is explicitly called out if mentioned, so you can't just think in terms of the worst possible case. Regardless I think they made several mistakes here as you note correctly.

BTreeSet is log(N) in number of nodes compared. But for strings that comparison is O(L) since you at least have to compare it to the string you find. So I believe it's at best O(L log(N)) where L is the average string length and N is the number of nodes which is worse than the hash table which is just O(L). It's tricky though if most strings don't match your string. In that case I believe you end up degrading to an O(L) + O(log(N)).

Similarly, you are correct that they can't be logarithmic in size and must be linear since you have to store the data. Tries can be sublinear depending on the input since it's a form of compression as well.

Similarly you are right about trie complexity not being O(log(L)) for trie misses. I think what they're observing is a massive speedup because mismatches error out on the first character usually. But it wouldn't be logarithmic as there's unlikely to be a logarithmic relation between headers that mismatch and the universe of matching words.

maciek12

1 replies

20h46m

2024-09-10 21:46:22 UTC

Thanks for replying, I really appreciate you taking the time to check my concerns! I was afraid I was going crazy haha.

The part about big O being usually about the average case was helpful, I'm still at uni where we mostly talk about worst case performance.

vlovich123

0 replies

19h25m

2024-09-10 23:07:33 UTC

I'll note I could very well be wrong - very back of the paper first principles derivation and could easily have made a mistake.

If you're in a relevant course, might be a good opportunity to flag this to a professor or TA as a way to review. These kind of analyses can be tricky sometimes, especially if you need to prove it vs just do an answer by feel which is how engineers end up doing it (vs CS students that take the proofs much more seriously).

A way to sometimes double-check is to just simulate performance as input grows and see if it matches the expected log(N).

IncreasePosts

3 replies

1d2h

2024-09-10 15:40:01 UTC

Off topic: how does 60M requests a second stack up with other big players like google, Facebook, TikTok?

xyst

1 replies

23h29m

2024-09-10 19:02:58 UTC

This 2013 research article from facebook indicates memcached handled “billions of requests per second” [1].

[1] https://research.facebook.com/publications/scaling-memcache-...

joshuamorton

0 replies

22h19m

2024-09-10 20:13:36 UTC

I will note that http requests and internal requests are distinct.

The Zanzibar white paper from Google in 2019, for example lists top-line rpcs to Zanzibar at around 10m qps, but internal (in memory) cache hits at 200m qps and non cache internal fan-out at >20m qps. Zanzibar of course isn't http, but is a good public example of internal rpc fan-out.

Id expect cloutdflare similarly has some extremely high qps internal services.

why_only_15

0 replies

16h40m

2024-09-11 01:51:43 UTC

People typically quote google search as ~300k QPS

wolf550e

2 replies

2024-09-10 18:04:02 UTC

I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?

steveklabnik

0 replies

22h5m

2024-09-10 20:27:32 UTC

why not show the full bit pattern in memory?

Rust doesn't guarantee a particular memory layout except for types explicitly requesting that you get a specific representation, so showing this would at least require a caveat that it's not guaranteed to be accurate.

Furthermore, this specific data structure's internals are generic, so to show the memory layout, it would have to be for a specific instantiation.

(They do also provide a structure that includes instantiations between u8 and u256, mentioning in a doc comment that the 256 size dominates overall size, and so if you want to slim it down, use the internals directly.)

SkiFire13

0 replies

23h39m

2024-09-10 18:52:43 UTC

By looking at the source code [1] they have a discriminated union with 3 variants representing the possible combinations of representing a valid word and leading to more nodes (excluding the case where it is neither, which is impossible). So the node for the 'o' in "do" should be a `SearchOrLeaf` storing "do", its corresponding value (the `T`, in a set this would be empty) and the `SearchNode` containing the successors, which should only contain a 't' to "dot".

[1]: https://github.com/cloudflare/trie-hard/blob/3d8163ca2d9208c...

sophacles

2 replies

1d2h

2024-09-10 16:23:47 UTC

So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?

What about for the other integer sizes that the trie supports?

Validark

0 replies

4h14m

2024-09-11 14:17:56 UTC

If the bit twiddling is independent in each u64, then probably. If it's treated as one giant bitstring, probably not.

FridgeSeal

0 replies

19h20m

2024-09-10 23:12:18 UTC

A lot of Rusts' optimisations are from LLVM, so if that's smart enough to, then I assume it also benefits.

pragma_x

2 replies

23h55m

2024-09-10 18:37:33 UTC

It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.

One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.

https://en.wikipedia.org/wiki/Bloom_filter

samatman

0 replies

21h37m

2024-09-10 20:54:58 UTC

There are significant differences between trie-hard and a Bloom filter. Bloom filters are probabilistic, and use hashing. They're great for when rare false positives are acceptable in exchange for no false negatives, which isn't uncommon, but isn't what's needed here: this is exact, and hashing is the step to beat.

Rather, this is a refinement/variation on Liang's algorithm, used in TeX for storing a hyphenation dictionary. The thesis mentions Bloom's algorithm, which it calls "superimposed coding", and very much hearkens back to a time when memory was often the most precious commodity. I think you'll like it. ^_^

https://tug.org/docs/liang/liang-thesis.pdf

gcbirzan

0 replies

6h50m

2024-09-11 11:41:46 UTC

The problem is that bloom filters are good when your dataset is large and calculating several hashes per item is not as expensive as looking it up in the original dataset. Here, they say that even one hash is expensive, so bloom filters would be much slower than the original hash map.

aidenn0

2 replies

1d2h

2024-09-10 16:21:55 UTC

Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.

sjf

1 replies

10h12m

2024-09-11 08:20:09 UTC

That is what I immediately thought as well. The problem though is that a perfect hash is still going to be O(key length) for the hash function, which they are trying to avoid.

In theory, if they used a regex it would be using a state machine to do the matching, which should have similar performance to the trie- only O(k) in the worst case. But from what I understand regex libraries don't actually build the state machine, they use backtracking so the performance is no longer O(k).

I'm surprised they couldn't find a performant regex library that already existed that uses state machines. It should have similar performance to the trie. But in reality, the performance is affected more deeply by things like memory access patterns and the performance of specific arithmetic operations, so it's impossible really to speculate.

lesuorac

0 replies

3h50m

2024-09-11 14:42:36 UTC

I wonder how much faster it gets when you shorten the maximum internal key length. If you re-map everything to like "cf-1", "cf-2", .. "cf-;" then you'd only need to hash like 4 characters.

y04nn

1 replies

20h55m

2024-09-10 21:37:34 UTC

Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.

intelVISA

0 replies

14h19m

2024-09-11 04:13:27 UTC

The fabled tech talent shortage, probably.

xyst

1 replies

23h24m

2024-09-10 19:08:33 UTC

It’s a bit wild to me that one needs to add so many http headers internally that they need to get stripped off.

The “trie-hard” crate is a solution to the layers of cruft at CF

FridgeSeal

0 replies

19h19m

2024-09-10 23:13:10 UTC

I don't think it would matter if there's only 1 header to be removed, it's still a problem to be solved.

alexchamberlain

1 replies

22h19m

2024-09-10 20:13:25 UTC

Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.

vlovich123

0 replies

21h23m

2024-09-10 21:08:42 UTC

No, you'd just be smearing the work at best. The expensive part is determining set containment not updating the data structure. It doesn't matter if you do it in a loop and update the data structure or do that check before writing.

Buckaroo9

1 replies

11h21m

2024-09-11 07:11:35 UTC

Shouldn't this line: > "using regular expressions clocks in as taking about twice as long as the hashmap-based solution." be written as the opposite? That the RE takes half as long/is twice as fast?

nitnelave

0 replies

10h29m

2024-09-11 08:03:22 UTC

I think the point is that naive regex are a very generic purpose tool, but it's still in the same ballpark. Having a custom optimized state machine for this specific use case could bring another 5x improvement on top, leading to 2.5x faster, potentially.

sixo

0 replies

21h36m

2024-09-10 20:55:52 UTC

this post makes me think I could work at cloudflare, and also that I shouldn't

pianoben

0 replies

16h13m

2024-09-11 02:19:32 UTC

This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!

...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/

mwexler

0 replies

5h3m

2024-09-11 13:29:11 UTC

I know, they aren't the first to use it, but I still appreciate the pun.

mgaunard

0 replies

22h35m

2024-09-10 19:57:15 UTC

If it takes 3.5 us to remove a substring within a packet-sized string you're doing it wrong.

hencoappel

0 replies

2024-09-10 18:02:16 UTC

Given it's such a hot part of their code, I'm genuinely surprised they didn't go down the state machine route for as optimal a solution, or even further and made use of vector instructions. Maybe Rust can autovectorise the code though.

gwittel

0 replies

18h54m

2024-09-10 23:38:03 UTC

Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial.

gregsexton

0 replies

3h37m

2024-09-11 14:55:14 UTC

The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.

A very simple analysis in R:

    > prop.test(c(9, 4), c(1103,1171))

    2-sample test for equality of proportions with continuity correction

    data:  c(9, 4) out of c(1103, 1171)
    X-squared = 1.4915, df = 1, p-value = 0.222
    alternative hypothesis: two.sided
    95 percent confidence interval:
    -0.002409831  0.011897193
    sample estimates:
        prop 1      prop 2 
    0.008159565 0.003415884

A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.

I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.

evmar

0 replies

1d2h

2024-09-10 15:55:13 UTC

This post uses a fancy data structure to construct a to_remove set, then iterates through it to remove these from the underlying header map.

It appears the "remove_header" call is this one: https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h... which calls .remove() on two other data structures, both of which bottom out in this mountain of code: https://docs.rs/http/latest/src/http/header/map.rs.html#1550

bhawks

0 replies

12h2m

2024-09-11 06:30:26 UTC

Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?

Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.

Validark

0 replies

3h56m

2024-09-11 14:36:16 UTC

It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.

I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.

The CPU is fast, memory is slow.

MehdiHK

0 replies

1d1h

2024-09-10 17:16:09 UTC

Slightly off topic: I was a bit surprised nobody ported space-efficient marisa trie to Rust, or any other languages.

Anybody knows why?

https://github.com/s-yata/marisa-trie