For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.
I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.
Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.
I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.
Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.
> Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs
A bomb defused in a migration that takes eleven seconds
The migration has to rewrite the whole table, bigint needs 8 bytes so you have to make room for that.
I have done several such primary key migrations on tables with 500M+ records, they took anywhere from 30 to 120 minutes depending on the amount of columns and indexes. If you have foreign keys it can be even longer.
Edit: But there is another option which is logical replication. Change the type on your logical replica, then switch over. This way the downtime can be reduced to minutes.
This largely depends on the disk. I wouldn't expect that to take 30mins on a modern NVME drive, but of course it depends on table size.
Large tables take hours, if not days. I attempted a test case on AWS using souped up io2 disks (the fastest most expensive disks they have) and a beast of a DB server (r5.12xl I think) and it became abundantly clear that at certain scale you won't be doing any kind of in-place table updates like that on the system. Especially if your allowed downtime is one hour maintenance window per week...
I did it on a r6.24xlarge RDS instance and the CPU wasn't doing much during the operation. IO peaked at 40k IOPS on EBS with provisioned IOPS, I'm not sure if a local disk would be any faster but I already know that rewriting the table and creating the indexes are all single threaded so there isn't much you could gain.
Once I got the logical replication setup to work I changed 20 tables on the replica and made the switch with 15 minutes of downtime. That saved me a lot of long nights.
You can get an idea of how long this could take by running pg_repack, it's basically doing the same thing: Copying the data and recreating all the indexes.
Disk wasn't the limit in my case, index creation is single threaded.
Which is still very IO bound... Wonder what kinda IOPS you were observing? Also they make pretty fast CPUs these days :)
In practice the only option that I’ve seen work for very large teams and very large relational databases is online schema change tools like https://github.com/shayonj/pg-osc and https://github.com/github/gh-ost (the latter developed for GitHub’s monolith). It’s just too difficult to model what migrations will cause problems under load. Using a binlog/shadowtable approach for all migrations mostly obviates the problem.
Rails migrations really fail to be viable at scale. And its not really because of the migrations in Rails, its because changes in PostgreSQL get very very expensive. Things have gotten better in PG 15, but its still not quite there yet.
I've ran into an integer PK being almost exhausted in values twice now. In one of these cases I used pg-osc to rewrite a 8TB table on disk without downtime over a period of a week to fix the issue. In the other case it was about 6TB within 2 days. Its doable, but the amount of planning and time it takes is non-trivial.
I so far have no found any other viable solutions either. I keep hearing about logical replication but I haven't seen a great real-world solution at scale so far.
You can also migrate it using logical replication.
In JavaScript land, postgres bigints deserialize as strings. Is your application resilient to this? Are your downstream customers ready to handle that sort of schema change?
Running the db migration is the easy part.
Depends on the lib. Max safe int size is like 9 quadrillion. You can safely deserialize serial bigints to this without ever worrying about hitting that limit in many domains.
2^53, to be precise. If your application involves assigning a unique identifier to every ant on the planet Earth (approx. 10^15 ≈ 2^50), you might need to think about this. Otherwise, I wouldn't worry about it.
11 seconds won't fix all your foreign keys. And all the code written against it that assumes an int type will accommodate the value.
It is still under the limit today with 362,107,148 repositories and 818,516,506 unique issues and pull requests:
https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXEoc...
Elapsed: 12.618 sec, read 7.13 billion rows, 42.77 GB
This is too long, seems the ORDER BY is not set up correctly for the table.
Also,
This is not a low cardinality:
7133122498 = 7.1B
Don't use low cardinality for such columns!
I'm guessing this won't be including issues & PRs from private repos, which could be substantial
That query took a long time
Do we know for sure if gitlab cloud uses a multi-tenanted database, or a db per user/customer/org? In my experience products that offer both a self hosted and cloud product tend to prefer a database per customer, as this greatly simplifies the shared parts of the codebase, which can use the same queries regardless of the hosting type.
If they use a db per customer then no one will ever approach those usage limits and if they do they would be better suited to a self hosted solution.
Unless something has substantially changed since I last checked, gitlab.com is essentially self-hosted gitlab ultimate with a few feature flags to enable some marginally different behaviour. That is, it uses one multitennant DB for the whole platform.
Not according to [1] where the author said
1. https://yorickpeterse.com/articles/what-it-was-like-working-...
I've toyed with various SaaS designs and multi tenanted databses always come to th forefront of my mind. It seems to simplify the architecture a lot.
I'm convinced that GitHub's decision to move away from Rails was partly influenced by a significant flaw in ActiveRecord: its lack of support for composite primary keys. The need for something as basic as PRIMARY KEY(repo_id, issue_id) becomes unnecessarily complex within ActiveRecord, forcing developers to use workarounds that involve a unique key alongside a singular primary key column to meet ActiveRecord's requirements—a less than ideal solution.
Moreover, the use of UUIDs as primary keys, while seemingly a workaround, introduces its own set of problems. Despite adopting UUIDs, the necessity for a unique constraint on the (repo_id, issue_id) pair persists to ensure data integrity, but this significantly increases the database size, leading to substantial overhead. This is a major trade-off with potential repercussions on your application's performance and scalability.
This brings us to a broader architectural concern with Ruby on Rails. Despite its appeal for rapid development cycles, Rails' application-level enforcement of the Model-View-Controller (MVC) pattern, where there is a singular model layer, a singular controller layer, and a singular view layer, is fundamentally flawed. This monolithic approach to MVC will inevitably lead to scalability and maintainability issues as the application grows. The MVC pattern would be more effectively applied within modular or component-based architectures, allowing for better separation of concerns and flexibility. The inherent limitations of Rails, especially in terms of its rigid MVC architecture and database management constraints, are significant barriers for any project beyond the simplest MVPs, and these are critical factors to consider before choosing Rails for more complex applications.
I can't really comment on GitHub, but Rails supports composite primary keys as of Rails 7.1, the latest released version [1].
About modularity, there are projects like Mongoid which can completely replace ActiveRecord. And there are plugins for the view layer, like "jbuilder" and "haml", and we can bypass the view layer completely by generating/sending data inside controller actions. But fair, I don't know if we can completely replace the view and controller layers.
I know I'm missing your larger point about architecture! I don't have so much to say, but I agree I've definitely worked on some hard-to-maintain systems. I wonder if that's an inevitability of Rails or an inevitability of software systems—though I'm sure there are exceptional codebases out there somewhere!
[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...
Do you have any sources on GitHub moving away from Rails? This is the first that I've heard and my googlefu has returned zero results around this. Just last year they had a blog post around Building GitHub with Ruby and Rails[0] so your remark caught my off guard.
[0]: https://github.blog/2023-04-06-building-github-with-ruby-and...
Perhaps too late, but Rails 7.1[1] introduced composite primary key support, and there's been a third-party gem[2] offering the functionality for earlier versions of ActiveRecord.
[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...
[2] https://github.com/composite-primary-keys/composite_primary_...
Typo?
Migrating primary keys from int to bigint is feasible. Requires some preparation and custom code, but zero downtime.
I'm managing a big migration following mostly this recipe, with a few tweaks: http://zemanta.github.io/2021/08/25/column-migration-from-in...
FKs, indexes and constraints in general make the process more difficult, but possible. The data migration took some hours in my case, but no need to be fast.
AFAIK GitLab has tooling to run tasks after upgrade to make it work anywhere in a version upgrade.
Being two orders of magnitude away from running out of ids is too close for comfort anyway.