You think your surrogate key will save you? It will not. The world has an external reality that needs to be reflected in your database. If the unique identifier for your object — VIN, CUSIP, whatever — if it changes, the world will henceforth refer to it by both. You will need to track both. Adding a synthetic key only means you have to track all three. Plus you have to generate a meaningless number, which is actually a choke point in your data throughput.
The natural key forces you to think about what makes the row unique. What identifies it. Sometimes, it makes you go back to the SME and ask them what they mean. Sometimes it makes you reconsider time: it’s unique now, but does it change over time, and does the database need to capture that? In short, what are the boundaries of the Closed World Assumption? You need to know that too, to answer any "not exists" question.
To use our professor’s car’s example, we actually do not know the database design. It could well be that the original identifier remained the primary key, and the "new id" is entered as an alias. The ID is unique in the Car table, identifying the vehicle, and is not in the CarAlias table, where the aliases are unique.
Oh, you say, but what if the bad old ID gets reused? Good question. Better question: how will the surrogate key protect you? It will not. The reused ID will be used to query the system. Without some distinguishing feature, perhaps date, it will serve up duplicates. The problem has to be handled, and the surrogate key is no defense.
Model your data on the real world. Do not depend on spherical horses.
This is bad advice.
Say I have a user table, and the email is unique and required, and we don’t let users update their email, and we don’t have user deletion. If I’m going natural PK, I make email the primary key.
But … then we add the ability for users to update their email. But it should still be the same user! This is trivial if we have a surrogate primary key, a nightmare if we made email the natural primary key.
Or building on that example, maybe at first we always require an email from our users. But later we also allow phone auth, and you just need an email OR a phone number. And later we add user name auth, SSO, etc. Again, all good with surrogate primary keys, a nightmare with natural primary keys.
There are countless examples like this. You brought up cars, same thing with licence plates, for example. Or even Social Security Numbers/Social Insurance Numbers - in Canada SINs are generally permanent, but temporary residents can have their SIN change if they later become permanent residents, but they’re still the same person.
You want your entities to have stable identity, even if things you at one time thought gave them identity change. Surrogate primary keys do that, natural primary keys do not. Don’t use natural primary keys, use surrogate primary keys with unique constraints/indexes.
I challenge you to come up with a single plausible example where you’re screwing yourself by choosing surrogate PK + unique constraints/indexes. Meanwhile there are endless examples where you’re screwing yourself by choosing natural PK.
You require three fields (or four): email at registration, a date for that entry (together these create a natural key), and current email (this one not part of the key and editable).
We're almost all the way to a Tag URI[0], so you could combine it with the user's name or username or any other identifier that fits the spec[1] (you could even use the website's own name) and you have a (definitely two thirds, probably 100%) natural key.
It's stable over time and unique, easy to mint, and has a standard behind it. The user also gets to change their contact details without any problem related to the key.
[0] https://taguri.org/
[1] http://www.faqs.org/rfcs/rfc4151.html
Except you're encoding PII in the ID, which makes them plainly visible to people who should not have access to user data, and hard or impossible to change. Sure, I could e.g. change my e-mail and the contact data would be updated, but you still have the old e-mail associated with my account via ID. I'm not sure this would fly under GDPR.
Erm, don't show the ID to people who don't need it.
Aside from that, it's not a violation of GDPR to keep personal information (that they consented to you having) in order to process business for that person. Using an email address as a unique identifier is not a violation, using it to spam them would be. If they're willing to give you their current email why not an old one?
How do you communicate with other people in your company about a customer without sending around PII if the customer's ID is PII?
Maybe we could create a field that uniquely identifies the customer that isn't PII. Then that could be used to uniquely identify a customer in places where we don't want to expose their PII. But then... why not just use this unique ID as the key?
Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others? Then why would you send this?
It's so strange an argument. Right now you have my username but not my email address, yet you can still query the website database and get certain data that you're allowed to see. There are so many ways to query a particular user's data, and they would all depend on what you're trying to do, needing the specific key would mean you should have access to it anyway and it could be given on per case basis anyway.
It's not an int, but yes, we have a unique synthetic identifier that serves as the database PK and as a means of communicating about a customer in insecure channels without exposing PII. "Customer ID ### is having an issue with such-and-such."
To turn your second part back around: why a natural key? What is the function of minting a natural key if humans are meant to use something else?
Because non-natural keys are unnecessary in the presence of a natural key, and unnecessary things bring in complexity.
Then you need access to the customer's ID, but the devil here is in the detail you didn't add, the such-and-such.
Use secure channels…
None of the things you've presented are actually "natural" keys, they are pieces of information that you've made assumptions about to shoehorn them into being usable as a "natural key".
No channel is perfectly secure. As channels become more secure, they become harder to use and add complexity. The more places you store customer data, the more risk you create. The attempt to force semantic data to serve as "natural key" has now added risk and complexity to your entire communication infrastructure.
When it comes to PII at my company, secure channels means "encrypted email only". No Slack, no Jira, no chat in video calls.
That's just not feasible for 100% of communications.
Frequently yes. It is extremely common to communicate about specific records using the ID for that record. The fact that this sort of behavior is extremely common is pretty clearly indicated by the question itself.
A responsible organization at scale with limit and log access to customer data. I should be able to determine if two people are talking about the same customer record without needing access to that record's PII.
It is much better to have an artificial key that is linked to this data. There is no upside to the natural key and many, many downsides.
But now then if you want to expose a detail page for that user the id for identifying that page has to include all this potentially personal information about them?
e.g. instead of mysocialmedia.com/users/2374927
you would be showing
mysocialmedia.com/users/email@example.com-2024-06-05-mysocialmedia.com
Then exposing a lot of information that you may have not wanted to expose.
You don’t have to use the PK as the URL slug. Even if you want to route that way, you can have an internal ID and external ID. This is one way to use something random like a UUIDv4 for display without incurring the costs (at least, some of them) of having it as a PK.
And then if you want to list other entities to that user you will have to start mapping the external id and foreign relationships every time to external users?
And also if you are doing exception logging, for ids/primary keys there's higher odds of them being logged out, including your own logs and also external platforms.
It feels like having primary key set up like this just will complicate everything unnecessarily for the future including many edge cases that you don't foresee.
Just have the main ID not have any meaning.
It shouldn't contain information about the date, it shouldn't be auto increment, it should really be just random.
The solution I outlined is the one GitLab and PlanetScale both use internally, so it has been tested at scale and works well, for both Postgres (the former) and MySQL (the latter).
That’s a great way to tank performance. You want your PK to be k-sortable.
If we're talking about relational database engines, that's what they do, relate things. One join statement is much the same as another.
ON UPDATE CASCADE is not the nightmare that you are making it out to be. (My impression from the article is that this is a single SQL database being discussed.)
Even if it's initially single, it's bad to assume that it will be so forever and that you are not going to use third party providers in the future.
How well does ON UPDATE CASCADE work if there's millions of existing relations to that entity?
YANGNI for 99% of projects and databases. When you get to global sharded nosql etc. you need to use UUIDs for anything and incrementing IDs falls over too.
I'm using UUIds by default for everything. Main point being that I don't have to worry about future restrictions.
And incrementing IDs are also problematic yes, since they hide business information data within them.
And I do think that I need it for much more than 1% of projects and DBs.
You’ll have to worry about performance tanking instead. If you’re using UUIDv7 then less so, but it’s still (at best) 16 bytes, which is double that of even a BIGINT.
Anyone who says UUIDs aren’t a problem hasn’t dealt with them at scale (or doesn’t know what they’re looking at, and just upsizes the hardware).
Most databases with a UUID type store them as 128-bit integers, typically the same as a BIGINT. It's not like 378562875682765 is the bit representation of a bigint either. And if you're not using uuidv7 or some other kind of cluster-friendly id, you'd best be using a hash index, and if you're doing neither, you probably don't care about their size or performance anyway. You don't pick UUIDs blindly, but on balance, they solve a lot more problems than they cause.
Postgres’ UUID type is 16 bytes. MySQL can store them as BINARY(16) once encoded. Conversely, a BIGINT for either is 8 bytes. Not sure about SQL Server or Oracle.
IME, this is precisely the problem – devs choose them blindly, because then you don’t have to think about proper modeling, you can arbitrarily create a key in your app and be nearly guaranteed of its uniqueness, etc.
This is a myth. Planetscale [0] uses integers. They are assuredly at scale.
As for auto-incrementing, there are plenty of solutions for distributed systems.
[0]: https://github.com/planetscale/discussion/discussions/366
It's never a single database in the real world. As soon as you integrate something or have an API to something the keys are out there. Unless you add a translation layer, but then you could just as well use surrogate keys directly.
Using a surrogate UUID for communicating with the outside world is often very useful.
This is true for an internal PK that's an auto-inc id as well as for natural keys, though.
Using a natural PK -inside- your own database can still be a lot more pleasant to work with, even if you don't let it escape.
Until you need to do anything like described above. The advantage of artificial keys is that they have no semantic content. Anything with semantic content carries the risk that the role that semantic content plays in your system can change and cause problems. Having a non-semantic identifier protects you from that.
This is not to say that you should never use a semantic identifier as a key. However, you should always have a non-semantic artificial key as the identifier and use the semantic identifiers only when necessary.
Sure, if all data is in a single DB. But in the real world you’ve generally got some/all of:
- 1 or more data warehouses
- Other services storing said data (e.g. the user id will live in many databases in a service oriented architecture)
- External API integrators who have their own data stores totally out of your control that also have copies of parts of your data
- Job queues. It’s common to have jobs scheduled for the future on some other system (Redis, etc.) that say “do X for user with id Y in Z days”. If the “id” changes these fail
- Caches. Say I cache data by user email instead of a surrogate key, user changes their email, and another user signs up with the old email. Cache hits for the wrong user!
- etc.
Changing the primary key becomes an absolute nightmare project once data lives in multiple places like this, and in my experience it mostly does.
Having truly stable identity for your entities just solves so many future problems, and it’s SO easy to do. In almost all cases, natural PKs are really all downside, virtually zero upside, except slightly less storage.
I was going to say; this is a perfect use case for a cascading FK.
So I come from academia, but generally if you use a natural key as PK in a foreign key constraint it may be possible to express additional consistency criteria as CHECK-constraints in the referencing table.
So this is a bad example, but say you have Name and Birthdate as your PK, and you have a second table where you have certain special offers sold to your customers and there is this special offer just for Virgos and Pisces, you could enforce that the birth date matches this special offer. Some modern systems also technically allow FK on alternate keys, so you could still do it that way, but database theory often ignores that.
But second, while I agree that surrogate keys are often a good idea, I find your argument, that you must design for every conceivable change, not convincing.
Another example is where you use a service that provides you with a stable id. It makes little sense to add a surrogate id and a fk on that surrogate id. It violates data quality and integrity just for a hypothetical situation.
Data integrity/quality matters. Adding friction to prevent accidents also matters. I don't want something accidentally and trivially updating a field that's used to reference thing externally.
Something about the nat key is about to change? Fine, we can write migrations. Even if it affects millions of rows, it's not a big deal.
I understand people have been burnt by bad design decisions involving nat keys, but they're not some devil's key everyone here dogmatically makes them out to be. You can mess up using anything.
I think there's the important point. Is your key actually natural or is it someone else's surrogate key anyway? Going back to the vehicle identification number: that's already a surrogate key. You just did not assign it yourself.
A VIN is not a surrogate key. A surrogate key must, by definition, have no semantic meaning, and not be composed of parts from multiple domains (among other requirements).
A VIN encodes the following:
* Country of origin
* Manufacturer
* Vehicle type
* Vehicle model
* Engine type
* Transmission type
* Model year
* Manufacturing plant
* Serial number
I would still almost always use an internal artificial key on top of the external id. If you want to enforce data integrity, you can still enforce uniqueness on the external id. "Stability" of an external identifier is almost always an assumption and one that I've seen fail enough times to want that internal id by default.
Why does anything need to be a primary key anywhere in order to enforce some constraint? At least from ORMs I know I can set for example any group of attributes unique. Other constraints can be implemented in some general method that is called when persisting in the actual database. Even if no ORM, you can write a wrapper around your persisting procedure.
It doesn’t, you’re right. However, indexes aren’t free, so if your data is such that a natural PK (composite or otherwise) makes sense, you’ll save RAM. Also, for clustering RDBMS like MySQL, data is stored around the PK, so you can get some locality boosts depending on query patterns.
As far as I can tell, all modern systems allow it, as it is part of the SQL standard that foreign keys can be either primary keys or unique indexes. Here's a brief quotation from a copy of ISO/IEC 9075-2:1999 (not the latest version) that I randomly found online:
So it mentions unique constraints first. Then afterward it says:
If I'm reading this right, it means that in the base case, where you specify the column to reference, it can be any unique constraint, where a primary key is just another possible unique constraint (as all primary keys are by definition unique). And only if you don't specify the fields to reference does it then fall back to the primary key instead of a named unique constraint.
I'm not disagreeing with you entirely - it's true that often there's an assumption in database theory that primary keys are natural and foreign keys are primary keys. But this isn't a hard requirement in practice or in theory, and it partly depends on the foreign key's purpose, why you need it in the first place. This StackOverflow answer also explains it well: https://softwareengineering.stackexchange.com/a/254566
I should add that there is also a set of database design wisdom that suggests you should never use database constraints such as foreign keys, only app/api constraints, but that's a whole different tangent.
That’s less a DB design thought and more of a “devs with little formal training in RDBMS who only want to use it as a dumb store” thought.
Use the DB to its strengths. CHECK constraints add an infinitesimal amount of overhead for writes, and guarantee that there will never be invalid data written. A bad code change could allow, however briefly, for that to occur.
"for every conceivable change".
That is not what he is arguing at all. He is showing that there are very many highly plausible changes that are problematic with natural keys. And he totally correct about that. Frankly, the fact that a post arguing for natural keys makes it the top of an HN comment thread is extremely weird. The original article is correct that natural keys are bad.
At this point, you should verify the new email. At least until it is verified, you must track the old email. At this point, you realize you can now introduce a synthetic key and you're fine.
Let's say you have a duplicate customer entry and the customer demands their accounts be merged. Now you can't identify the user by their key alone, since by definition, they can't be the same (yet.)
Except for having to update all foreign references. Some of those may be external further complicating issues.
Emails are often among the worst keys because they are not terribly stable and they are reusable often enough to burn you.
Also are emails case sensitive or not? In some systems (that you don't control mind you) they are and others they are not...
Per RFC5321, the local part (before @) _may_ be case-sensitive, but in practice, it almost never is, and relying on case sensitivity is a recipe for disaster.
The domain must always be case-insensitive.
My point being is if you're using email as a key you "have to" treat it as case sensitive even though for most it's not. And yes, I agree it will be a recipe for disaster.
If I’m going natural PK, I make email the primary key.
Welcome to the Mr. Cooper mortgage provider website.
Your logon is your email and you can't change it.
If you used your cable provider email you're stuck with them for the life of your 30 year mortgage.
Oh, also, we've been breached and your information is available for purchase on the dark web. Fun!
Funny you should say that...
I've been very slooowly degoogling myself, and that includes changing all logins that have a gmail address to a different non google email.
I'd say only like 1/3 of the sites I made logins for have the option of changing the email.
My first job in the late 2000's was at a small university with a home-grown ERP system originally written in the 80s (Informix-4GL). Student records, employee records, financials, asset tracking - everything. It used natural compound keys.
Even worse than the verbose, repetitive, and error-prone conditions/joins was the few times when something big in the schema changed, requiring a new column be added to the compound key. We'd have to trawl through the codebase and add the new column to every query condition/join that used the compound key. It sucked.
Another benefit of having stable identies / surrogate primary keys is that any relations (FKs) will be much simpler.
Sure, like the post poster you replied to is pointing out, you _can_ use natural keys, and then also relying on dates or other parts of the data - but creating a relation for that can end up being extremely cumbersome.
- Indexes generally become larger - relationships become harder to define and maintain - Harder for other developers to get up to speed on a project
The surrogate key uniquely identifies a row in your database, which is an entity just as real and significant as the car or the employee or what-have-you. Don't confuse the two!
I agree with you that having a surrogate key isn't going to save you from the reasons why natural keys can be difficult. The complexity has to go somewhere. But not having a unique identifier for each row is going to make things extra difficult.
The main thing is that the synthetic key should never leave the database and never be displayed in the app - if you want to have another key that represents the human oriented key do it, but it should be another field, an indexed field even, but one that has a lot less monotonic sequential properties that are inherent to synthetic database identifiers.
You want to change that human key? Sure. You want to to complain that the keys are not sequential? Sure. You want to actually make them weird strings that are harmful to my brain? Why not? You want to update primary keys in the database? No. Absolutely not.
I assume you hint at the security aspect of monotonic keys?
I've found this issue a bit overblown. It's basically security by obscurity, which is a nice bonus, but not something your security model can be based on.
I mean, it is a good practice to expose some kind of non-sequential key (e.g. UUIDv7), but it doesn't seem to me like a dealbreaker.
Security by obscurity makes it more difficult for bad actors as an additional layer that they need to break. I don't see a reason why that's a bad thing and doesn't take a lot of effort to implement in this case.
I have been in a startup where competitors used our sequential keys to scrape a list of customers. Lesson learned the hard way with actual business consequences!
Sequential keys also leak information (German tank problem)
Your competitors can estimate number of customers, your growth rate and other stats that you often don't want them to know.
https://en.m.wikipedia.org/wiki/German_tank_problem
If your system allows customers to see each other (or worse: unauthenticated users to see customers) in this fashion in the first place then whether you're using a sequential integer v. a random UUID is the least of your problems.
The 'customers' could be free tier users - a social media type system where everyone has a public profile - intended for the public - would still be scrapable by /profile/1, profile/2, etc. Doesn't necessarily require 'authentication' for the exposing of sequential integers to have a bad outcome.
You're right. The urls were public to be shared (think of marketing material / ecommerce), so there was not a security incident.
But it did give our competitor free highly qualified leads that they could use to poach customers. This product was new to our customers, and we had spent a lot of time selling and convincing them that it was useful.
I'm not saying it's a bad practice, the opposite actually.
True, if you start your application from scratch. Like if I started designing a new app today, I'd just choose the UUIDv7 for the primary key.
It's not an easy thing to add into an existing application, though. I see applications leaking their internal IDs all the time, but usually it's not worth the effort to fix that, because it's a comparatively minor problem.
I'm sorry that I didn't make that more clear. I saw that you mentioned it as a best practice and are aware of the advantages. It's just that there are so many others that don't have the balanced view as you seem to have.
I have been involved in many discussions at my work place where "security by obscurity" is used as a way to shut down discussions. They changed their minds about sequential keys after the incident I mentioned, but it still has the power to "win" other discussions. Sure, we need to have rate limiting on ip-addresses, auth and other mechanisms, but they are not perfect and bugs happen all the time. An "unguessable" id is an additional security layer
I agree, but there are ways to reduce the attack surface. You could add an extra "public id" field that can be used for lookup in addition to the existing id. In this way you can have a gradual migration where you go through each endpoint and migrate them individually without changing the foreign keys and other relations in the database (they would still use the sequential key). Maybe you end up not having time to migrate them all, but at least you can reduce the attack surface on the most sensitive endpoints.
If you have low volume endpoints you could perhaps even simply add a mapping layer where you do a simple db lookup to replace the public key with the internal without changing the existing queries. You could even cache this easily to reduce the db load. (both ids are permanently fixed and can be cached forever).
UUIDv7 may not be generated sequentially, but it is still temporally ordered and discloses a timestamp, which may be an undesirable leakage for some applications/environments. When obfuscation matters that much, use a UUIDv4 and eat the index performance hit.
Some might suggest, "encrypt your object identifiers on the way in/out", but there's a ton of pitfalls involved since for most applications they are now rolling their own crypto, and it also makes identifiers much longer.
Yes, you can go to great depths, but they each have trade-offs - in performance, increased complexity etc. and it's necessary to make a judgment for each particular app instead of applying the most overengineered solution everywhere.
I love it when my competitors use sequential integer IDs.
But the DBMS already maintains a row identifier (called rowid or ctid or whatever depending on the DBMS). Why do you need an explicit one?
It may be useful if you have data that originates from another source or if something outside of our system references your entity. In that case you need to keep some form of an externalRef, so it's usually easier to just use an id that you can control, for referencing both internally and externally.
Be careful with those. SQLite has a rowid concept, but it's not guaranteed to be stable - running a VACUUM against a table can reassign the rowids for every row!
https://sqlite.org/rowidtable.html says:
Because every DB can and will shift those as needed. They reference the physical location on disk for a given tuple. They are not meant for general consumption.
In the DB we use[1] the internal row id is not stable:
The value returned by the function is not necessarily constant between queries as various operations performed on the database may result in changes to the row identifiers of a table.
So, users should refrain from using the ROWID function in ordinary situations; retrieval by primary key value should be used instead.
[1]: https://infocenter.sybase.com/help/index.jsp?topic=/com.syba...
Precisely. And if I have a surrogate key to identify “a car”, I get to define what makes a car unique for my purposes. Maybe that really is the VIN. Maybe today it’s the VIN, but tomorrow it’s whatever vehicle is currently being driven by its owner. Maybe it’s something else.
Some day I may need to track multiple VINs for a vehicle (maybe it’s got parts from multiple VINs and I want to track that). I can still always decompose that table and have an n-to-1 relationship between cars and VINs without migrating the rest of my data model.
A entity can exist over more than one row in your database, but it is useful to uniquely identify each row as the lowest common denominator.
Terrible advice.
Surrogate keys are keys are a layer of indirection.
They don't fix all problems, but they fix some problems.
Not least of which is performance. Often natural keys are character strings, whereas surrogate keys can be fixed size integers, saving index sizes on your FKs.
This launches into the other debate about PKs: using UUIDs rather than sequential keys.
This is less of a debate, and more of an indicator of who has had to work with a DB at scale using UUIDv4 everywhere.
Don’t blow up your B+trees.
I guess.
Either your system is happy enough to route every new entity through "one DB at scale" so it can let your "one DB at scale" be in charge of an auto-incrementing long, or it isn't.
A common method is to have a small app (which can quite easily be HA) that hands out sequential chunks to each shard, interleaving and/or gapping as necessary.
Conversely, certain queries can be much faster by using natural keys when the FK is all that you need in the result rather than additional fields in the primary table. In this case, the primary table doesn't need to be queried at all. This doesn't generally overcome the benefits of synthetic keys, but it is an optimization sometimes put into practice.
Aren't you literally describing an index?
No. Maybe an example helps. You have a users table with username as natural PK. You have an access table with timestamps of users hitting a service, with FK of username. If you query the access table for a range of timestamps and just want the usernames, they're right there in the results of the access table query. If you had instead used a synthetic user_id key, the db would have to do an additional lookup in the users table via a join for each user_id to get the usernames.
Sorry. This is a very bad advice. I just had to fight tooth and nail to make my lead turn around from this disastrous decision. Using a lot of external IDs as our own row primary keys and then they get propagated to all other tables as foreign keys and what not. One day the foreign key chances or God forbid, the formatting changes in external systems, now we need to fix our whole database and all codes instead of a small isolated place.
Generate your own unique keys for everything; add a few more unique constraints if needed. A bit more work but never a regret.
Yeesh. I once made the mistake of using an external ID as a primary key. What a day it was when they were changed on me.
I've kept with this advice for the most part, but I'm tempted in some cases to use the external id when there's some guarantee of stability and universality. Like 2 and 3 digit ISO country codes.
Not that I'd get about 5 different ISO country code changes (with some flipflopping) just by sitting in this very same spot for a couple of decades. "Stability" in country codes, bah humbug.
Your geographical location might not be stable, but I'm referring to the ISO codes that name the country, which should be relatively stable.
Can you share more about this? Wouldn't you run into the same problems if you used a surrogate pk? Without the nat/external pk and fk, you run the risk of having validity issues.
Conversely, if the ID changes, isn't the friction what you want?
I feel like optimising for unlikely edge cases instead of integrity because of a single incident is too reactionary.
Writing a query or script to update a string, even for millions of rows, isn't that big of a deal?
Obviously there _are_ cases where nat pks/fks are bad, but not all of them.
Instead of migrating one column, now you have that and every table with a foreign key to the original. Requiring transactions and possibly locking etc.
Surrogate keys do mirror reality though. As I once read in a Terry Pratchett book; if you replace the handle of an axe and then replace the head, is it still the same axe?
For me, the answer is yes - since we imbue the axe with an identity outside of it's integral parts.
That is what a surrogate key is. An identity. Which is an abstract concept that exists in the real world.
And to pile on. The top comment is bad advice! Surrogate keys provide sanity - god save you if you have to work in a database solely using natural keys.
Yes and no. It is the Axe of Theseus ;)
https://en.m.wikipedia.org/wiki/Ship_of_Theseus
Or Trigger's Broom:
https://www.youtube.com/watch?v=BUl6PooveJE
Thanks - I think somewhere in the back of my mind I was also aware of the Ship. But Pratchett is too good. The quote:
Yes. Ultimately, "are these the same" and "are these different" are philosophical questions. Or to be more precise, teleological questions.
Because those questions have no meaning except with an "for our purposes here" added. And it's up to us to decide what we care about, if anything.
If we care about what other people want with the data though, or suspect that our own wants might not be set in stone, then we should also care to model identity independently in the system (that is, use surrogate keys).
Typically for the purposes of ownership. So it's really part of a a hierarchical identity scheme.
In the wise words of Patsy, "it's only a model".
The real world is resistent to clean abstractions and abstractions are distressingly subject to change. What made your row unique today is quite likely to become non-unique in the days/months/years to come.
Always use surrogate keys. Your future self will thank you.
And always kids, write code for one person, and one person only: your future self.
Yeah, fsck your co-workers and your replacement when you quit.
If you’re good to your future self you’re also good to them.
There is a model of a thing and there is a row that stores a representation of that model of a thing.
They both are things. Ignoring the last one might be tempting, but it’s not practical.
Interestingly your own way of thought is applied, but now a level deeper again.
How do you model a row? What makes it unique? A surrogate ID is the only sensible unique identifier for such a thing as there is no “natural key” that would make sense for instances of something so general as “Row”.
What you were saying amounts to “don’t model the thing holding the model”, but experience shows the thing holding the model is itself an (often unwilling) active part of systems.
Someone here gave the example of wrongly entered PK’s by administrative personnel handing live customers. That’s IMO a good example of why you need an extra layer on top of your Actual Model(c). I can think of more.
Corollary: your app is part of the real world.
Furthermore: bugs in your app are part of the real world, so woe to those who used keys from your app as natural/external keys in their app.
How do you deal with the Ship of Theseus/Trigger's broom? There's literally nothing that defines said object apart from its history.
Your database starts to look like git...
Think about a jet engine. Let's say you figure out that a part is defective and will cause a failure. You want to identify every plane that has one of those parts in it. What if you find that some of them had a bad oil pump that shortened the life of some bearings. You want to know every engine that had one of those pumps, so you can replace other parts.
I dont know if they do this with jets but there are quite a few places that take thee concept much further than this.
No, your table archives start to look like git. Which is not a bad thing, version control on data in a database is very difficult, if not impractical to realize, I would take any history that I can get...
"Synthetic" keys are just that, keys for tables that only have meaning within the context of that data. You can of course relate your records to as many other tables as you wish, but you identify records via those keys, nothing more.
In short, you advise us to foresee the future, explore unknown unknowns and expect high-precision true answers from the outside. Good advice, not for this universe. You can only get false negatives in this one.
A synthetic key means “we think exists”. There exists a contract, a medical record, a person, in a real world, in our opinion. We record these existence identities into our model by using an always-unique key. Then there’s a date, an appointment #, a name, etc. You can refer to an entity by its identity, or search by its attributes. If you use searches in place of identity references, you get non-singletons eventually and your singleton-based logic breaks.
This is why it's hard to be a DBA... Everyone thinks you're a Cassandra user the way other developers ignore your prophecies.
In terms of the Danish CPR that is mentioned here, the way we actually solved the challenge in the national architecture strategy was to define your social security number(s) as what we call an address on your person. I’m not sure why it was called an address, but it’s basically a UUID and the information. Maybe it’s because it was first used to store addresses?
Anyway. Unless your system has not yet implemented the national strategies (which by now are approaching 24 years) then changing a Danish Social Security won’t actually matter because it’s just an added address to your person “object”. So basically you’ll now simply have two, and only which is marked as the active. Similar to how you’ve got an array of addresses in which you have lived but only one main address active.
It did indeed cause a lot of issues historically because it was used as a natural key… though with it being based on dates, it was never really a good candidate for a key since you couldn’t exactly make it into a number unless you had some way to deal with all the ones beginning with 0 being shorter than the rest. Anyway… it was used as a key, and it was stupid.
Anyway I both agree and disagree with you. Because we’ve successfully modelled the real world with virtual keys, but you access it by giving any number of natural keys. Like… you can find the digital “me” by giving a system my CPR number, but you can also find me by giving a system my current address. Technically you could find me by giving a name, but good luck if you’re dealing with a common name. There is a whole range of natural keys you can use to identify a “digital” person, but all of it leads from a natural key into a web of connected virtual keys.
All of it is behind some serious gatekeeping security if you’re worried. Or at least it’s supposed to be.
It's been years since I worked on systems with CPR numbers, but I seem to recall that there is also a policy in place, stating that you are not allowed to use the CPR number as a "primary key". Many companies did anyway, because they never actually bothered to read the guidelines and regulations.
All the systems I've seen always had the CPR as a lookup for a UUID, which as you say is the the "address" of the actual person object.
Yes, and normalize it. https://en.wikipedia.org/wiki/Database_normalization
This is true up to a point. You can add more data to the system to continue to generate natural, composite keys. However at some point you move from a database to an event stream, or you have to track events that aren't really needed for what your doing...
Denormalization then takes precedence and a generated key makes sense again. https://en.wikipedia.org/wiki/Denormalization
It isnt about protection, it's about not collecting the natural data to identify the event that caused the issue. Its denormalization by omission in effect.
Been vehemently against surrogates my whole life, glad to find a kindred soul.
When I integrate systems, I use that system's natural key (love it when it's a unique ID, but in the systems I work in - it almost never is).
That said, I use that natural key as the "link" to my internally managed, normalized database.
There's nothing that says I cannot add unique identifiers that would replicate the natural key. In fact, that's good design.
In that case, using this natural datum as an ID is a contradiction, because we are now removing the uniqueness constraint. It's fine to model the real world, but the real world also includes your data, and you may want to identify unique _records_ as it is not a universal truth that data can be corrected on entry.
Not really because your natural ID has to also account for the problem of garbage data and account for the SME's not actually being experts. And I can give a real world example of this happening; the Canadian Long Gun registry.
For anyone that doesn't know, prior to the early mid 90's or so Canadian gun laws only required registration of pistols. I might be incorrectly remembering but IIRC the records were not handled at a national level either but I could be wrong. Around that time new laws were introduced that among other things required registration.
Most guns by then had serial numbers so all you had to do was tie a gun to a serial number with some characteristics and voila, you've got a natural identifier, right? That's what the experts say.
Well as it turns out, reality isn't quite so kind. A lot of firearms makers such Cooey from around the 1940's to the 1970's produced guns without a serial number. In other cases the serial number was present from the factory but were damaged, or the part that had thee number had been replaced without replacing the number or had been replaced with a wrong number. In rare cases the serial number from the factory was wrong because of a mistake when the worker manually stamped in the number.
So already the idea to uniquely identify using some sort of simple classifier was already flawed. They attempted to solve the issue of guns without serial numbers but those stickers were cheaply made and readily fell off, and owners that were already peeved about the program to not bother with trying to paperwork correct.
Which segways to the next problem. There was an extremely high rate of errors in registration forms being submitted. The most famous example I'm aware of was someone registering a Black and Decker soldering gun as a firearm, something he had done in protest. As humorous as it was, it the revelation that a soldering gun had been classified as a firearm unveiled another fundamental problem.
The error rate was so high, and the pressure to show progress so great, that the someone in leadership (I can't remember if it was the government or RCMP) directed the data entry clerks to just plug the data in with no validation as is. Didn't matter if the data was wrong, or made no sense, or contradicted other entries already in the database. The intent being to just get all the data in as is so that they could fix it later. So all that wrong information? Got pushed straight into the database.
Like I said, this was real world mess that occurred from 1995 until 2012 when a new government dropped the requirement for non restricted firearms to be registered with little fanfare and only squeaking protests.
It's not to say that you shouldn't think about a 'natural key' persay. But the problem assuming that there is a 'natural key' requires that you or your subject matter expert is actually an expert that can identify a good enough model for that to exist.
But what happens when your SME is just plain wrong and it in turn introduces fundamental flaws in your model? Or outside influences forces garbage data in? How is a database designed to model only the real world supposed to cope with that?
Your post brings up a critical difference I’ve noticed when working with devs (I’m a DBRE): those who actually do rigorous data modeling, and those who view it as an annoyance impeding their project.
Spend time modeling your schema. Ask yourself, “does every attribute in this table directly relate to the primary key? And is every attribute reliant upon the primary key?” Those two alone will get you most of the way through normalization.
Yes it will. Your changes will be confined to only the table(s) where the natural key is present, not spread across every table where there's a foreign key.
Of course you will still have to deal with the reality that the natural key is now not unique, and model reality, but your implementation work in doing so is far simpler.
In more years than I care to count I've regretted someone using natural keys as a primary key for a table many times, and surrogates never.
Of course they won’t save you from external data. The whole point is for your system to have a way to identify rows internally so you can deal with external systems getting wonky without corrupting your own data.
All of your concerns are easily solved by a unique index.
Using external keys as a forcing function to prevent people from representing data wrong is not great.
This took me years to realize, and once I did things became much, much simpler.
No it isn't. Working with natural keys in general involves using compound primary keys since it is unlikely that any lone field is suitable as primary key. Comparing an integer is quick. Comparing three string fields in a join is not.
In my experience, this won't end well. Some examples:
Belgium has the RNR/INSZ identifying each person. But it can change for a lot of reasons: It gets reused if someone dies. It encodes birth date, sex, asylum state, so if something changes (which happens about every day), you need to adapt your unique key.
Belgium also has a number identifying medical organizations. Until they ran out of numbers. Then someone decided to change the checksum algorithm, so the same number with a different checksum meant a different organizations. And of course they encode things in the number and reuse them, so the number isn't stable.
An internal COBOL system had a number as unique key, and this being COBOL, they also ran out of numbers. This being COBOL, it was more easy to put characters in the fixed width record than expand it. And this being French COBOL, that character can have accents. So everyone using the 'number' as unique key now had to change their datatype to text and add unicode normalization. Not fun.
In my experience: don't use anything as an ID that you didn't generate yourself. Make it an integer or UUID. Do not put any meaning in the ID. Then add a table (internal ID, registering entity, start date, end date or null, their key as text). You 'll still sometimes have duplicates and external ID updates as the real world is messy, but at least you have a chance to fix them. The overhead of that 1 lookup is negligable on any scale.
Adding to the list of comments damning this post to ensure none of my future colleagues follow this advice.
It definitely will, say my 20+ years of experience.
Natural keys can change, and synthetic keys never have to. That alone is reason enough to use synthetic keys.
Performance is another major argument for synthetic keys, as they are can be made sequential, which is rarely the case for natural keys.
Yes, it will. It is precisely because of a messy external reality that you need an unchanging internal ID that is unaffected by changes to the external ID. If the software designer in the article had followed your advice, changing the chassis number would have likely resulted in broken car ownership records.
Whoever is reading this in the future, please don't follow the parent comment's advice. Use surrogate/synthetic keys for your primary key.
Every time I used a natural key I have to come to regret it.
Sorry, by your logic an ISSN would be a good key for a database of scientific journals. It's exactly what ISSN is invented for! Right? Right?
Been there, done that. Journals that changed their names (and identities) but not ISSN. That changed the ISSN but not the name/identity. Journal mergers which instead of obtaining a new ISSN kept one of the old ones. "Predatory journals" that "borrow" an ISSN (you may not consider them real journals, but you've got to track them anyway, even if only to keep them from being added to the "main" database). The list may go on and on.
And don't even start me on using even more natural ID, the journal name, perhaps in combination with some other pieces of data, like year the publication started, country of origin, language, etc... Any scheme based on this will need to have caveats after caveats.
(A fun fact: there were journals that ceased publication but later on "returned from the dead". Such resurrected journals are supposed to be new journals and to get a new ISSN. Sometimes this rule is followed...)
At the end, a "meaningless number" you assign yourself is the only ID that reliably works (in combination with fields representing relationships between journals).
The problem with keys that "have meaning" is that they appear to carry information about your entity. And in vast majority of cases this is correct information! So it's almost impossible to resist "extracting" this information from a key without doing actual database lookup at least mentally, and often in your software too. Hidden assumptions like this lead to bugs that are really hard to eliminate. A meaningless number on the other hand does not tempt one :-)
When designing a table - you should always be clear about what the natural key is and make sure it has uniqueness constraint. Mindlessly having a surrogate key without thinking about what the natural key is, is an anti-pattern. So totally agree here.
That doesn't stop you also having a surrogate key though.
Another aspect of natural versus surrogate keys is joins as the key often ends up in both tables.
Using natural keys can mean in some circumstances you can avoid a join to get the information you want - as it's replicated between tables.
There is also the question of whether you surface the surrogate key in the application layer or not - some of the problems of surrogate keys can be avoided by keeping them contained within the app server logic and not part of the UI.
So via the UI - you'll search by car registration number, not surrogate key, but in terms of your database schema you don't join the tables on registration number - you use a surrogate key to make registration numbers easier to update.