If your goal is to prevent DoS attempts from degrading the service of other tenants in a multitenant environment, fair queuing is the optimal approach. Give each client their own queue to which incoming traffic is enqueued, and have a background routine that repeatedly iterates over each queue, dequeuing a single request and servicing it. Any client that spams requests will only congest their own queue and not those of other clients.
A few of extra considerations picked up over many years of hard lessons:
1. Rate limits don't really protect against backend capacity issues, especially if they are statically configured. Consider rate limits to be "policy" limits, meaning the policy of usage will be enforced, rather than protection against overuse of limited backend resources.
2. If the goal is to protect against bad traffic, consider additional steps besides simple rate limits. It may make sense to perform some sort of traffic prioritization based on authentication status, user/session priority, customer priority, etc. This comes in handy if you have a bad actor!
3. Be prepared for what to communicate or what action(s) to perform if and when the rate limits are hit, particularly from valuable customers or internal teams. Rate limits that will be lifted when someone complains might as well be advisory-only and not actually return a 429.
4. If you need to protect against concertina effects (all fixed windows, or many sliding windows expiring at the same time), add a deterministic offset to each user/session window so that no large group of rate limits can expire at the same time.
Hope that helps someone!
What exactly do you mean by your first point?
A rate-limit is most often some arbitrarily configured static value, e.g. each org can make 10 req/s. It's much harder to configure rate limits against some dynamic value like system load so most go with the static value approach.
Like the OP said, this doesn't protect you from going over your system capacity, you can still have 10 million orgs all requesting 10 req/s which can take down your system while abiding by your rate limits.
Maybe that per-customer rate limits don’t guarantee the whole backend won’t go over capacity? Though I guess many apis will have global rate limits as well for these cases.
protect against backend capacity issues
That's our primary use case, so I am also curious to hear more.
add a deterministic offset to each user/session window so that no large group of rate limits can expire at the same time
Did you mean non-deterministic (like jitter)?
GP meant deterministically add jitter.
Long ago I was responsible for implementing a “rate limiting algorithm”, but not for HTTP requests. It was for an ML pipeline, with human technicians in a lab preparing reports for doctors and in dire cases calling their phone direct. Well my algorithm worked great, it reduced a lot of redundant work while preserving sensitivity to critical events. Except, some of the most common and benign events had a rate limit of 1 per day.
So every midnight UTC, the rate limit quotas for all patients would “reset” as the time stamp rolled over. Suddenly the humans in the lab would be overwhelmed with a large amount of work in a very short time. But by the end of the shift, there would be hardly anything left to do.
Fortunately it was trivial to add a random but deterministic per patient offset (I hashed the patient id into a numeric offset).
That smoothly distributed the work throughout the day, to the relief of quite a few folks.
Rate limits don't really protect against backend capacity issues
Yes and no, there's a little more nuance here. You're correct that the business signing up X new users, each with new rate ~limits~ allocations, does not in and of itself scale up your backend resources, i.e. it's not naively going to vertically scale a Postgres database you rely on. But having a hardware rate limiter in front is like setting the value on the "max" setting on your autoscaler - it prevents autoscaling cost from skyrocketing out of control when the source of the traffic is malicious/result of a bug/"bad"; instead a human is put in the loop to guage that the traffic is "good" and therefore the rate limit should be increased.
Rate limits that will be lifted when someone complains might as well be advisory-only and not actually return a 429
How does one set an "advisory-only" rate limit that's not a 429? You can still return a body with a 429 with directions on how to ask for a rate limit increase. I don't think of 4xx as meaning that the URL will never return something other than a 4xx, rather that the URL will continue to return 4xx without human intervention. For example, if you're going to write blog.example.com/new-blog-entry, before you publish it, it's a 404, then after the blog post is published, it will return a 200 instead.
Great advice!
Ideally, you can provide isolation between users on the same "tier" so that no one user can crowd out others.
I am such a fanboy for this kind of data viz stuff. Are you using D3?
Just canvas APIs! A lot of fillRect and roundRect in a requestAnimationFrame loop
Interesting, I would have guessed you had used something jupyter-like:
That can't be real. Awesome.
Interesting idea is rate limiting the client by requiring him to solve a puzzle for his request to be handled.
If his last request was recent make the puzzle harder. If last request was less recent make puzzle easier.
The puzzle might be like the one in bitcoin mining protocol. Guessing which bit string with specific amount of zeros at the end produces some random hash.
Alternatively charge them more for lower rate limits.
That would help against Sybil attacks, since a rate limit on a free service could be avoided by making more accounts. It does have the issue of being dependent on the client’s hardware to provide a server-side rate limit, though.
I've implemented a lot of client handling code and always wondered what the optimal back-off strategy was when I hit a rate limit. It's interesting to read about the trade offs from the perspective of the service since that can inform how a client best reacts.
There is also the opposite, as someone who once worked on a larger platform when we enabled rate limits, our first implementations caused issues with the service precisely because so many were hitting the API often enough, that when the rate limits were enabled they functionally expired at the same time for all the heavy users, meaning the service received a crapload of requests at the exact time, followed by a period of very low amount of requests, rince repeat.
What was the solution to these? Naively, I would guess that either the expiration time for each limit would need some jitter to spread out the requests, or the rate limits would need to be tiered (e.g. max 10/second, max 60/minute, max 500/hour) so that the shorter timescales help to smooth out traffic over longer timescales.
But I haven’t worked much with network rate limiting, so I’m curious what the actual solutions look like.
Last year I tried very hard to get some rate-limiting in our lambda to work against an upstream target (so that our jobs don't trigger the rate limit of the upstream API). Sadly I could not find much literature on it specifically focusing on rate-limiting on NodeJS. No matter what I tried it would just not work on AWS Lambdas (would constantly overshoot the target, leading to the guess that something is wonky with timing), while passing the tests locally. I still don't know if it's because the timers on Lambda are behaving strangely (as token buckets need to be refilled) or if every rate limiting library out there for NodeJS is just broken. But also my own try wasn't any more reliable so... who knows.
Probably need to store the bucket on some kind of persistent storage like ElastiCache?
It is a shame GCRA is not more well known and used for rate limiting. It is, in my view, a better algorithm.
https://medium.com/smarkets/implementing-gcra-in-python-5df1... https://en.m.wikipedia.org/wiki/Generic_cell_rate_algorithm
Isn't GCRA a variant of leaky bucket, i.e. the token bucket described in the article?
As far as I can tell, the behavior should be the same, the difference is just implementation details and what you track.
Absolutely awesome.
What do you do when even your rate limiting layer gets fully saturated with requests? Does one have any options other than involving CF?
I thankfully never was in the postion to experience this but I always wondered how far let's say nftable rules go in thwarting a DoS attack against a conventional webapp on a tiny VPS.
Congrats on a great post, informative and to the point with the best visualization I have seen for such a short content.
Excellent work on this. You can feel the craft and time you put into this post. Well done.
very well put together article!
I have wanted this resource many times in my career. Glad it finally exists.
the visiual performance is great!
I usually encounter rate limiting when trying to scrape some websites. I was even rate limited when manually browsing a website which considered I am a bot.
At some point you have to stop clients from enqueuing further requests, you can't grow the queue indefinitely. At this point, isn't it equivalent to rate limiting each client and a shared queue?
Not quite, because it isn’t a fixed rate limit. Suppose you can handle 120 requests/second across all customers. If you have 3 clients with active requests, they each are being served 40 requests/second, even if one of them has filled up their maximum pending requests. If you have 6 clients, each are being served 20 requests/second.
If you are applying a fixed rate limit to each client, you’d need to adjust the rate limit dynamically based on the current number of clients in order to reproduce the same behavior.
This eliminates the benefits of multitenancy since you don't get to downsize the server - you still provisioning for maximum rates on all clients simultaneously
Something like Inngest's multi-tenancy aware flow control should be table stakes for most complex products now: https://www.inngest.com/docs/guides/flow-control.
Huge disclaimer here: I'm one of the founders. You should most definitely be able to set concurrency levels, throttling, rate limiting, etc. per your own tenant, in code, without having to mess around with creating independent queues or streams for users, and without managing state. We let you do that.
I'm not sure I follow the argument. The scheduling seems like it would be de-coupled from the provisioning. With the static rate limit, if the only additional requests are from the high-rate client, then those requests are ignored and the server is idle. With the fair scheduling, the server is only idle if all requests from all clients are filled.
Which is beneficial depends on how the payment scales. If clients pay for access up to some rate limit, then the rate limit is there to enforce payment, and serving additional requests above that rate limit is an additional cost without a benefit. If clients pay per request, then any rate limits are there to ensure quality of service, and serving additional requests above the rate limit is additional revenue.
Good point.
What would you recommend if requests are highly parameterized and some can be many orders of magnitude more taxing on the system than others?
We usually implement queueing on the route to those specific things that are vulnarable. If you can not discern what traffic does what, you just need move the rate limiter close to the application or the problem hot spot. It's perfectly valid to give a HTTPS response 429 from a backend and let your frontend handle that in some graceful way. The same is valid as an exception in code, the nearer the problem spot you get the harder it is to get right.
EDIT clarification.
In the abstract sense instead of pulling from queues round-robin you can assign "tokens" to each queue round robin. When the number of tokens a queue has is equal to the cost of the request reset the tokens and pull that request.
This can also be used to handle priority. Maybe paying customers or customers on the enterprise plan get 2 tokens per round or their requests only have half of the cost.
Isn't this technically a form of token bucket?