One thought that comes up: Wouldn’t it be easier to have each customer in their own (logical) database? I mean, you don’t need transactions across different customers, right? So you’re essentially solving a harder problem than the one you’ve got.
Not sure postgres (logical) databases would scale that well, but don’t see a principal reason why it couldn’t. Has anyone explored this further?
This seems to be an architecture Cloudflare is aiming to support with their SQLite service. One database per customer, each database located in the customer’s primary region.
I think there's quite a few people chasing similar ideas, like Azure's Durable Entities.
I've been calling it the Lots of Little Databases model vs the Globe Spanning Gorilla.
Like the Spanner paper points out, even if your distributed database semantically appears like a single giant instance, in practice performance means developers avoid using distributed joins, etc, because these can lead to shuffling very large amounts of intermediate results across the network. So the illusion of being on a single giant machine ends up leaking through the reality, and people end up writing workarounds for distributed joins like async materialization.
If we give up the single machine illusion we get a lot of simplification, at the cost of features devs were unlikely to use anyhow. I see having consistent distributed commit but without cross shard joins as a really interesting alternative.
And besides scalability I like the extra security rope of fine grained partitioning from the start.
I'll write a blog post along these lines if I get anything worthwhile done.
An advantage worth noting is that having actually separated databases means you physically can't make these expensive operations, so a junior dev can't write incredibly inefficient code that would bring down your entire infra.
Also makes it a lot harder for devs to do some footgun and leak data across domains.
Bring down the infra or foot you a 6 figures bill at the end of the month
Interesting. Durable Entities strikes me as closer to Cloudflare's Durable Objects (both in name and in design as actors backed by persistent storage).
"Lots of Little Databases" reminded me of https://www.actordb.com/ which does lots of server-side sqlite instances, but the project now looks defunct.
That would be fantastic. Unfortunately it's not true. D1 doesn't support the one database per customer approach unless you have just a handful of customers that you can set up manually.
You have to create each database manually using wrangler or the website. Then you have to create a binding for each database in wrangler.toml so that the database becomes accessible as a variable in your Workers code. Then you have to change your Worker source code to do something with that variable. Then you redeploy.
The issue is that Workers cannot create or list databases. There's no API for it.
They have said they're working on some sort of solution (which they intend to have available by the time D1 exists beta) that will allow some sort of dynamic bindings specifically to address this use case.
In the meantime, though it's not a great solution, one can create and query databases through the REST API instead https://developers.cloudflare.com/api/operations/cloudflare-...
Yes, we've been doing that at my place basically since the start. Each tenant is a schema in postgres. Works perfectly fine on the one hand, i.e. your tables don't grow to 'infinity' just because you're adding more and more tenants. If there's a particular tenant that has lots of data, only that tenant's indexes and tables grow huge and become slower because of that particular reason etc. If a tenant leaves, you keep the schema around for some time, so they can come back and then at some point you just drop the schema!
It does mean having to upgrade each schema individually, which also makes it both easier and harder. Easier because the tables are smaller, so any schema changes that require things like say a table lock, are locking for a smaller amount of time and won't affect more than the one tenant at a given time. It also means that you can get into an inconsistent state of course, where some of your tenants have all the latest DB upgrades, while it failed on another subset.
At some point Postgres's internal tables become a bit of a "problem", as you want to run as many of these updates in parallel as you can for speed, which could lead to contention on Postgres' administrative side. You'll also still need to shard across multiple actual RDS instances, because you still have many tenants running against a single physical piece of hardware that will show its limitations if too many large or active tenants happen to be on the same shard.
And then you have the problem of keeping a record of which physical RDS instance (i.e. shard) the tenant is on. Your application code will need to look that up (and cache that info for some time ;)) and you have some choice there as well. I.e. do you shard those as well and juggle load balancing as basically a 1:1 mapping to shards or do you have your application layer connect to all database shards and handle any tenants? One is more complicated I would say while the other could run out of connections depending on how you need to scale the application layer and what kind of RDS instance you have.
I recall a popular rails gem[1] once upon a time that provided multi-tenancy via postgres schemas.
As it turns out, even the company the initially developed the gem ended up ditching the approach due to some of the issues you outlined above.
Managing separate schemas feels like one of those nefarious decisions that make things simple _initially_ but get you into a world of hurt when you need to scale. The company is since defunct but they have an article where they discuss why they ditched the approach [2], TL;DR it's too difficult to maintain and scale
[1] https://github.com/influitive/apartment#tenants-on-different... [2] https://web.archive.org/web/20201108191323/https://influitiv...
Let's address these one by one based on our experience (the part of the journey that I've been there at least as the implementation of the solution predates me but I live with it).
Migrations
We have more than a thousand schemas per database shard. That is why I said you want things to run in parallel for speed, yes. However, we deploy way more than daily and it's not really an issue in that sense. Schema updates are not that frequent but you do of course need to take into account that you will have to make the schema updates as a separate PR, wait and check that it worked and then deploy your actual change making use of the changes in application code. Which honestly isn't much different from ensuring that your BE changes and FE changes are compatible or made in the right order so that you don't get failed requests because an old FE happens to call a new BE node or vice versa :shrug:Database resources
We seem to have m5.large. That has 2 virtual cores. Some r5.larges etc. Their r3.4xl has 16?! So not sure what kind of load pattern they have :shrug:Client Memory Bloat
Yeah well, we don't have that, so not sure what to say :shrug: Definitely not a generalized reason to say "can't do, is too complicated for scaling".Record Identification
I respectfully disagree. This is a great advantage because it means just because you get more and more tenants (some of which churn and you throw away their data anyway) your identifiers don't grow past limits as easily. In fact, in most cases the identifiers never run out of runway at all.They complain about "what if you need to join this data somewhere else, then you need to also add the tenantId". Yeah, so? We also have a data warehouse we we do just that. Not a problem at all. We also have other services than our main one, which do use different database technologies where we use tenant as part of the key (for loads that actually benefit from being in a NoSQL type DB) and there we do not have sharding other than what the NoSQL database does by itself so to speak by dividing the keyspace.
That's it. End of article. Basically, we have none of these issues. They don't mention their actual scale. The only number they mention is the "hundreds of schemas". We have more than ten times that number per physical shard and have tens of thousands of tenants total. Again :shrug:
That's still 2 orders of magnitude smaller than the scale of Figma—they would need to somehow manage millions of Postgres schemas. I don't think it's a realistic possibility
https://news.ycombinator.com/item?id=39711815 Shopify seems to have millions of customers. Granted, they have MySQL and it might not use database schemas and something more MySQL specific. In general though ...
"We can attach a shop_id to all shop-owned tables and have it serve as a sharding key. Moving a shop from one shard to another involves selecting all records from all tables that have the desired shop_id and copying them to another MySQL shard. For this post, it’s helpful to think of each pod as a MySQL shard."
Sounds like this is just normal sharding. Completely different from having a totally separate schema like GP proposes (with that approach you don't use sharding keys at all).
This is the conclusion I came to when faced with the same quandary.
This is a very common approach and scales quite well. I worked for a company that had thousands of customers and each had their own schema. A single master database that kept track of which customer is on what physical db cluster, and this was globally replicated (EU,ANZ, NA).
Certainly needs a bunch of tooling, but worked well. Some apps were stateless and could connect to any physical cluster. Some others were sticky and only connected to a subset.
Similar architecture in my current company as well and we serve nearly a thousand customer instances served across 4 physical clusters.
We do have some basic tools to provision new customers on the emptiest cluster, move customers from one cluster to another etc
Shopify has pretty much mastered this https://shopify.engineering/mysql-database-shard-balancing-t...
This works great until (1) your largest customer outgrows the largest available DB (happens sooner than you'd think for large companies) or (2) you do need transactions across different customers, say to facilitate some kind of sharing. Going all-in on the isolated tenant strategy means when you hit one of these cases it's a nightmare to unwind and rearchitect your entire DB layer.
I'd respond by saying (1) is more rare than you're asserting.
There is a huge long tail of companies with datasets that won't fit on a single machine but can be handled by a dozen or so, and where no single customer or service is near the limits of an individual node.
These customers are poorly served at the moment. Most of the easy to implement SaaS options for what they need would be hugely costly vs a small fleet of db servers they administer. Meanwhile, most of the open source options are cargo culting Google or Facebook style architecture, which is a huge burden for a small company. I mean do you really want to run K8S when you have 10 servers in total?
I think there's a lot of interesting stuff happening in this end of the market that's not trying to be a mini Google, like Fly.io.
As for (2), I think a middle ground is supporting cross shard transactions but not joins. This works well enough for VoltDB et all.
It's rarely an issue with the number of bytes and more an issue with hot shards. Whatever shard Google is on (that is, Google the Figma customer) is surely going to be one _hell_ of a hot shard. They have more designers/engineers/PMs actively using Figma than most startups have _users total_. You don't need more than one really really hot customer for this to become more than a hypothetical problem.
When you start to think about it that way, suddenly you need to seriously consider your iops (if you're on RDS) or how much redundancy that physical machine's SSDs have (if you're running it on your own boxes).
Google still only has ~180k employees, and obviously not all of them use figma, and obviously not all of their figma users are performing actions simultaneously. I'd be surprised if it broke 10k QPS (would an org like Google even have 10k peak active user sessions? Seems doubtful). Human generated traffic tends to just not reach that large of scales unless you're trying to fit the entire planet on one instance.
RDS can be absurdly limited with IOPS, granted, but a modern laptop for example ought to be up to the task. Realistically you could probably be fine even on RDS but you might need to pay through the nose for extra IOPS.
Figma is more or less a desktop application that happens to run in a web browser.
If I use Photoshop to edit a .psd file I don’t think “man that psd file should really be stored in a single planet-sized database of all psd files in existence”. It’s just a file on my computer.
Figma requires a little bit more intermingling of data than Photoshop, it has multiuser support for one, so a pure local storage based approach wouldn’t work. But, at its heart it’s still based on the document model. When you open a Figma document it’s its own isolated little universe, the connections with resources outside the document are limited, and that matches user expectations.
Agreed - it's not like you'd ever have to do SELECT * FROM Shapes WHERE 'type' = 'circle'. Could they have stored each document as a file, stored references to the files in the database, opened the file in the backend when someone opened it in the frontend, and written it back when they've stopped editing it?
A figma customer won't exceed the requirements of an i3.metal...
Can you give an example of when a single customer has outgrown the largest available DB?
The problem - conceptually - is made much simpler this way; we make use of this at work.
However you will still have shared resource problems - some rogue query destroys IOPS in one tenant now ends up bringing down all tenants etc. There are in theory databases that solve this as well, but my experience has been that at that point what you buy into is a bad version of resource sharing - ie what an operating system does - and you’re better off using OS mechanisms
In other words: yes, but you still have noisy neighbours, and may be better off running lots of small fully separated DBMSes than a big logically separated one
If tenants are on separate databases how would that be an issue?
Because database, in Postgres terms, doesn’t mean physical node. It’s more akin to a VM than anything else. The term for an installation of Postgres is database cluster, which can contain N databases, each of which can contain M schemas.
Thanks! Is there a good primer to this terminology that clarifies these terms on various popular database and cloud platforms?
It seems there is good potential for confusion unless we use the same terms consistently when discussing architecture and design across teams.
Even the term RDS (Relational Database Service) is sometimes said to be inaccurate since it is a "Relational Database SERVER as a Service"
A few related terms that cause confusion:
"Schema" could refer to a Database Schema or in some contexts, a single table's logical data structure. (Or a single data set's data structure -- or a standard for one, like JSON Schema)
Data Catalog products like "AWS Glue Data Catalog" which only store the metadata or schemas of the table they crawl ... refer to the entities they store as "Databases" and "Tables" (but no "Schemas") and documentation includes guides talk about "creating a database"[1] and "creating a table"[2] in AWS Glue. There has to be a better way to refer to all these entities without using the word schema with so many meanings -- or calling both physical tables and their metadata as "tables". Otherwise this is needlessly confusing and hard for beginners.
---
EDIT: even more madness. This StackOverflow discussion [3] has more examples of confusing usage:
[1]: https://docs.aws.amazon.com/glue/latest/dg/define-database.h...
[2]: https://docs.aws.amazon.com/glue/latest/dg/tables-described....
[3]: https://stackoverflow.com/questions/298739/what-is-the-diffe...
Not AFAIK. In MySQL, the only time “cluster” is used is to refer to NDB Cluster, which is a distributed DB product. Schema means a logical grouping of tables, the same (more or less) as Postgres.
As to schema itself, yes, it’s heavily overloaded, and you just to grok it from context. I can talk about a table’s schema or a DB’s schema, or as you mentioned, JSON schema. Although the latter is helped by simply not using JSON in a relational DB.
You must remember that SQL is an ancient language, and the relational model is even older. There are going to be oddities.
Separate logical databases, within the same RDBMs, so sharing CPU, disks, RAM etc
I imagine it only gets you so far. What do you do about customers like Walmart or Oracle? Hundreds, if not thousands, of users all leaving hundreds of comments on Figma files every day. If you want good latency without giving up strong consistency (which the article says they want) you'll need to keep sharding.
A single db can handle that load easily
Do you mean an RDBMS running on a single, big-iron, bare-metal server?
They were running their whole business on one RDS instance 4 years ago. Do you think they now have one customer larger than all their customers combined 4 years ago?
The first sentence from the article seems to suggest its possible?
I bet it gets you further than you imagine. Entirely depends on the backend services and what they touch but in this scenario you would be deploying/scaling that service based on the customer seat size. I suspect that even for large enterprise customers, the users actively touching Figma are not reaching he thousands but I am happy to be wrong.
After all, Stackoverflow is running off of a handful of machines.
This, seriously. The long-term maintenance, tribal knowledge & risks associated with this giant hack will be greater than anything they'd ever have expected. Inb4 global outage post-mortem & key-man dependency salaries.
There's no virtually no excuse not spinning up a pg pod (or two) for each tenant - heck even a namespace with the whole stack.
Embed your 4-phases migrations directly in your releases / deployments, slap a py script to manage progressive rollouts, and you're done.
Discovery is automated, blast / loss radius is reduced to the smallest denominator, you can now monitor / pin / adjust the stack for each customer individually as necessary, sort the release ordering / schedule based on client criticality / sensitivity, you can now easily geolocate the deployment to the tenant's location, charge by resource usage, and much more.
And you can still query & roll-up all of your databases at once for analytics with Trino/DBT with nothing more but a yaml inventory.
No magic, no proprietary garbage.
Figma has millions of customers. The idea of having a Postgres pod for each one would be nearly impossible without completely overhauling their DB choice.
You are making a major conflation here. While they do millions of users, they were last reported to only have ~60k tenants.
Decently sized EKS nodes can easily hold nearly 800 pods each (as documented), that'd make it 75 nodes. Each EKS cluster supports up to 13,500 nodes. Spread in a couple of regions to improve your customer experience, you're looking at 20 EKS nodes per cluster. This is a nothingburger.
Besides, it's far from being rocket science to co-locate tenant schemas on medium-sized pg instances, monitor tenant growth, and re-balance schemas as necessary. Tenants' contracts does not evolve overnight, and certainly does not grow orders of magnitude on week over week basis - a company using Figma either has 10 seats, 100 seats, 1000, or 10,000 seats. It's easy to plan ahead for. And I would MUCH rather having to think of re-balancing a heavy hitter customer's schema to another instance every now and then (can be 100% automated too), compared to facing a business-wide SPOF, and having to hire L07+ DBAs to maintain a proprietary query parser / planner / router.
Hell, OVH does tenant-based deployments of Ceph clusters, with collocated/coscheduled SSD/HDD hardware and does hot-spot resolution. And running Ceph is significantly more demanding and admin+monitoring heavy.
Reported where? Does that include a monolithic "free tenant" that would be larger than thousands of their other tenants put together? Every Figma user has their own personal workspace along with the workspace of any organizations they may be invited to
Thousands of directories over thousands of installations? It’s not that far fetched.
Actually, Apple does this for iCloud! They use FoundationDB[1] to store billions of databases, one for each user (plus shared or global databases).
See: https://read.engineerscodex.com/p/how-apple-built-icloud-to-...
Discussed on HN at the time: https://news.ycombinator.com/item?id=39028672
[1]: https://github.com/apple/foundationdb https://en.wikipedia.org/wiki/FoundationDB
This is sort of true and sort of false. When you think of a "database", if you're thinking of a Postgres database, you're way off the reality of what "database" means here.
FoundationDB has a concept called "layers", and essentially they have created a layer that looks like a separate database on top of a layer that is separately encrypted groups of keys. They don't have billions of FoundationDB clusters or machines, and at the infra level, i.e. the instances of the FoundationDB server software, it's unaware of individual "databases".
A closer analogy would be like having billions of tables, but even that isn't accurate because in relational databases a table is usually a notably more static concept than data in a table. The closest analogy would be that each of the billions of users has a bunch of rows with a user-id field on them, and there's a proxy that filters everything such that you can view the table as if it only had one user's data in it.
To be clear, FoundationDB is awesome and Apple have done some really cool stuff with it, but it's less crazy/impressive than it sounds.
This sounds like a pretty standard multitenant datastore. Everything has a user/group Id on it, and a logical layer that locks a connection to a specific group.
FoundationDB is really just a set of structures and tools to build any type of database you want on top of a solid foundation.
"FoundationDB decouples its data storage technology from its data model. FoundationDB’s core ordered key-value storage technology can be efficiently adapted and remapped to a broad array of rich data models. Using indexing as an example, FoundationDB’s core provides no indexing and never will. Instead, a layer provides indexing by storing two kinds of key-values, one for the data and one for the index."
https://apple.github.io/foundationdb/layer-concept.html
Then existing standard layers like the Record layer, providing "(very) roughly equivalent to a simple relational database" providing structured types, index, complex types, queries, etc.
https://github.com/FoundationDB/fdb-record-layer
Or one for documents, which speaks the MongoDB wire protocol
https://github.com/FoundationDB/fdb-document-layer
Yeah, the advantage or difference here is that these "layers" are a common design pattern with FoundationDB, several ship in FDB by default, and you're encouraged to make more, so the database certainly has better support than just adding a column for TenantID, but still you're right that it's not too out there.
I have pondered about this for quite some time and came to the conclusion that it would make schema migrations more difficult to handle. I think Shopify is using an approach which is similar to what you are describing. The advantage is that you don't end up with hot shards because you can move around large customers independently.
In practice there isn't a big difference, they just colocate several customers according to their sharding key in the same logical database.
I remember Postgres table spaces being used to separate customers at a previous job - I can't remember how migrations were handled (pretty sure they were applied per table space) but I don't think it was a problem (at our scale anyway).
seems like the limitation is the logical file system. Which probably will work for most users.
I worked in a place that had a database for each tenant and the schema migrations were a real pain. Every time everything goes smoothly except these few databases that have an edge case that screws the whole migration.
I worked at a place that did this with MySQL. Every tiny, trial account got their own database. Every huge, actual customer got their own database. Migrations were kinda painful. I would think carefully about doing it this way.
we've been doing this for 20k databases with mysql for the last 10+ years. It solves more problems than it creates. Migrations are trickier, but you get sharding, data isolation and easier backups that way.
I'm not saying it's always a bad idea, you just need to think about what you're doing. This was closer to 15 years ago now. We had to develop a bunch of our own tooling, and make our own modifications to frameworks that are now ancient history.
Ongoing product development with migrations, tweaking indexes etc. becomes really hard. Every small database tweak now has to be deployed over 1000s of databases.
If customers are not of similar size, and you have a lot of customers, managing all the different databases can be a big headache.
Having a more granular partition key makes it easier to shard the data more evenly.
Nile is a Serverless Postgres that virtualizes tenants/customers. It is specifically built for SaaS companies similar to Figma https://www.thenile.dev/. I am the CEO of Nile.
Im curious for anyone who has done this: what’s the point of going all the way from one-db to one-db-per-customer? Why not just split the customers to 2 databases, then to 3, etc? Seems like the same level of system complexity but you avoid the lots-of-databases scaling problem.
You probably don’t even need to be able to migrate people between shards… just put everyone on one db until you hit 50% utilization, then spin up a fresh db and put all new customers on that one. Repeat whenever you run out of databases under 50%.
I believe physalia [0] explores this concept at production scale quite well.
[0] https://blog.acolyer.org/2020/03/04/millions-of-tiny-databas...
Spanners interleaved tables seem like a similar solution, ie you interleave customer data so it all ends up on the same set of hosts for performance, while still having the ability to create transactions across customers.
It sounds like they actually did something like this. Their shard key selection could be customer, project, folder or something in their data model at a reasonably high logical level in their hierarchy.
I've done something like this before, with each customer getting their own schema within a single Postgres instance.