return to table of content

You'll regret using natural keys

jklowden
120 replies
15h4m

You think your surrogate key will save you? It will not. The world has an external reality that needs to be reflected in your database. If the unique identifier for your object — VIN, CUSIP, whatever — if it changes, the world will henceforth refer to it by both. You will need to track both. Adding a synthetic key only means you have to track all three. Plus you have to generate a meaningless number, which is actually a choke point in your data throughput.

The natural key forces you to think about what makes the row unique. What identifies it. Sometimes, it makes you go back to the SME and ask them what they mean. Sometimes it makes you reconsider time: it’s unique now, but does it change over time, and does the database need to capture that? In short, what are the boundaries of the Closed World Assumption? You need to know that too, to answer any "not exists" question.

To use our professor’s car’s example, we actually do not know the database design. It could well be that the original identifier remained the primary key, and the "new id" is entered as an alias. The ID is unique in the Car table, identifying the vehicle, and is not in the CarAlias table, where the aliases are unique.

Oh, you say, but what if the bad old ID gets reused? Good question. Better question: how will the surrogate key protect you? It will not. The reused ID will be used to query the system. Without some distinguishing feature, perhaps date, it will serve up duplicates. The problem has to be handled, and the surrogate key is no defense.

Model your data on the real world. Do not depend on spherical horses.

yashap
48 replies
12h39m

This is bad advice.

Say I have a user table, and the email is unique and required, and we don’t let users update their email, and we don’t have user deletion. If I’m going natural PK, I make email the primary key.

But … then we add the ability for users to update their email. But it should still be the same user! This is trivial if we have a surrogate primary key, a nightmare if we made email the natural primary key.

Or building on that example, maybe at first we always require an email from our users. But later we also allow phone auth, and you just need an email OR a phone number. And later we add user name auth, SSO, etc. Again, all good with surrogate primary keys, a nightmare with natural primary keys.

There are countless examples like this. You brought up cars, same thing with licence plates, for example. Or even Social Security Numbers/Social Insurance Numbers - in Canada SINs are generally permanent, but temporary residents can have their SIN change if they later become permanent residents, but they’re still the same person.

You want your entities to have stable identity, even if things you at one time thought gave them identity change. Surrogate primary keys do that, natural primary keys do not. Don’t use natural primary keys, use surrogate primary keys with unique constraints/indexes.

I challenge you to come up with a single plausible example where you’re screwing yourself by choosing surrogate PK + unique constraints/indexes. Meanwhile there are endless examples where you’re screwing yourself by choosing natural PK.

brigandish
14 replies
11h10m

You require three fields (or four): email at registration, a date for that entry (together these create a natural key), and current email (this one not part of the key and editable).

We're almost all the way to a Tag URI[0], so you could combine it with the user's name or username or any other identifier that fits the spec[1] (you could even use the website's own name) and you have a (definitely two thirds, probably 100%) natural key.

It's stable over time and unique, easy to mint, and has a standard behind it. The user also gets to change their contact details without any problem related to the key.

[0] https://taguri.org/

[1] http://www.faqs.org/rfcs/rfc4151.html

TeMPOraL
8 replies
6h45m

Except you're encoding PII in the ID, which makes them plainly visible to people who should not have access to user data, and hard or impossible to change. Sure, I could e.g. change my e-mail and the contact data would be updated, but you still have the old e-mail associated with my account via ID. I'm not sure this would fly under GDPR.

brigandish
7 replies
4h41m

Erm, don't show the ID to people who don't need it.

Aside from that, it's not a violation of GDPR to keep personal information (that they consented to you having) in order to process business for that person. Using an email address as a unique identifier is not a violation, using it to spam them would be. If they're willing to give you their current email why not an old one?

lolinder
6 replies
4h34m

Erm, don't show the ID to people who don't need it.

How do you communicate with other people in your company about a customer without sending around PII if the customer's ID is PII?

Maybe we could create a field that uniquely identifies the customer that isn't PII. Then that could be used to uniquely identify a customer in places where we don't want to expose their PII. But then... why not just use this unique ID as the key?

brigandish
5 replies
4h25m

Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others? Then why would you send this?

It's so strange an argument. Right now you have my username but not my email address, yet you can still query the website database and get certain data that you're allowed to see. There are so many ways to query a particular user's data, and they would all depend on what you're trying to do, needing the specific key would mean you should have access to it anyway and it could be given on per case basis anyway.

lolinder
3 replies
4h21m

Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others?

It's not an int, but yes, we have a unique synthetic identifier that serves as the database PK and as a means of communicating about a customer in insecure channels without exposing PII. "Customer ID ### is having an issue with such-and-such."

To turn your second part back around: why a natural key? What is the function of minting a natural key if humans are meant to use something else?

brigandish
2 replies
3h37m

To turn your second part back around: why a natural key? What is the function of minting a natural key if humans are meant to use something else?

Because non-natural keys are unnecessary in the presence of a natural key, and unnecessary things bring in complexity.

"Customer ID ### is having an issue with such-and-such."

Then you need access to the customer's ID, but the devil here is in the detail you didn't add, the such-and-such.

communicating about a customer in insecure channels

Use secure channels…

shkkmo
0 replies
1h40m

Because non-natural keys are unnecessary in the presence of a natural key, and unnecessary things bring in complexity.

None of the things you've presented are actually "natural" keys, they are pieces of information that you've made assumptions about to shoehorn them into being usable as a "natural key".

Use secure channels…

No channel is perfectly secure. As channels become more secure, they become harder to use and add complexity. The more places you store customer data, the more risk you create. The attempt to force semantic data to serve as "natural key" has now added risk and complexity to your entire communication infrastructure.

lolinder
0 replies
1h41m

Use secure channels…

When it comes to PII at my company, secure channels means "encrypted email only". No Slack, no Jira, no chat in video calls.

That's just not feasible for 100% of communications.

shkkmo
0 replies
1h45m

Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others?

Frequently yes. It is extremely common to communicate about specific records using the ID for that record. The fact that this sort of behavior is extremely common is pretty clearly indicated by the question itself.

There are so many ways to query a particular user's data, and they would all depend on what you're trying to do, needing the specific key would mean you should have access to it anyway

A responsible organization at scale with limit and log access to customer data. I should be able to determine if two people are talking about the same customer record without needing access to that record's PII.

It is much better to have an artificial key that is linked to this data. There is no upside to the natural key and many, many downsides.

mewpmewp2
4 replies
6h30m

But now then if you want to expose a detail page for that user the id for identifying that page has to include all this potentially personal information about them?

e.g. instead of mysocialmedia.com/users/2374927

you would be showing

mysocialmedia.com/users/email@example.com-2024-06-05-mysocialmedia.com

Then exposing a lot of information that you may have not wanted to expose.

sgarland
3 replies
6h21m

You don’t have to use the PK as the URL slug. Even if you want to route that way, you can have an internal ID and external ID. This is one way to use something random like a UUIDv4 for display without incurring the costs (at least, some of them) of having it as a PK.

mewpmewp2
2 replies
6h11m

And then if you want to list other entities to that user you will have to start mapping the external id and foreign relationships every time to external users?

And also if you are doing exception logging, for ids/primary keys there's higher odds of them being logged out, including your own logs and also external platforms.

It feels like having primary key set up like this just will complicate everything unnecessarily for the future including many edge cases that you don't foresee.

Just have the main ID not have any meaning.

It shouldn't contain information about the date, it shouldn't be auto increment, it should really be just random.

sgarland
0 replies
5h47m

The solution I outlined is the one GitLab and PlanetScale both use internally, so it has been tested at scale and works well, for both Postgres (the former) and MySQL (the latter).

It shouldn't contain information about the date, it shouldn't be auto increment, it should really be just random.

That’s a great way to tank performance. You want your PK to be k-sortable.

brigandish
0 replies
4h39m

And then if you want to list other entities to that user you will have to start mapping the external id and foreign relationships every time to external users?

If we're talking about relational database engines, that's what they do, relate things. One join statement is much the same as another.

twobitshifter
12 replies
6h43m

ON UPDATE CASCADE is not the nightmare that you are making it out to be. (My impression from the article is that this is a single SQL database being discussed.)

mewpmewp2
6 replies
6h29m

My impression from the article is that this is a single SQL database being discussed.

Even if it's initially single, it's bad to assume that it will be so forever and that you are not going to use third party providers in the future.

How well does ON UPDATE CASCADE work if there's millions of existing relations to that entity?

twobitshifter
5 replies
5h57m

YANGNI for 99% of projects and databases. When you get to global sharded nosql etc. you need to use UUIDs for anything and incrementing IDs falls over too.

mewpmewp2
3 replies
5h29m

I'm using UUIds by default for everything. Main point being that I don't have to worry about future restrictions.

And incrementing IDs are also problematic yes, since they hide business information data within them.

And I do think that I need it for much more than 1% of projects and DBs.

sgarland
2 replies
5h20m

You’ll have to worry about performance tanking instead. If you’re using UUIDv7 then less so, but it’s still (at best) 16 bytes, which is double that of even a BIGINT.

Anyone who says UUIDs aren’t a problem hasn’t dealt with them at scale (or doesn’t know what they’re looking at, and just upsizes the hardware).

chuckadams
1 replies
3h44m

Most databases with a UUID type store them as 128-bit integers, typically the same as a BIGINT. It's not like 378562875682765 is the bit representation of a bigint either. And if you're not using uuidv7 or some other kind of cluster-friendly id, you'd best be using a hash index, and if you're doing neither, you probably don't care about their size or performance anyway. You don't pick UUIDs blindly, but on balance, they solve a lot more problems than they cause.

sgarland
0 replies
1h23m

Postgres’ UUID type is 16 bytes. MySQL can store them as BINARY(16) once encoded. Conversely, a BIGINT for either is 8 bytes. Not sure about SQL Server or Oracle.

You don't pick UUIDs blindly, but on balance, they solve a lot more problems than they cause.

IME, this is precisely the problem – devs choose them blindly, because then you don’t have to think about proper modeling, you can arbitrarily create a key in your app and be nearly guaranteed of its uniqueness, etc.

sgarland
0 replies
5h24m

incrementing IDs falls over too

This is a myth. Planetscale [0] uses integers. They are assuredly at scale.

As for auto-incrementing, there are plenty of solutions for distributed systems.

[0]: https://github.com/planetscale/discussion/discussions/366

jval43
2 replies
6h23m

It's never a single database in the real world. As soon as you integrate something or have an API to something the keys are out there. Unless you add a translation layer, but then you could just as well use surrogate keys directly.

mst
1 replies
5h57m

Using a surrogate UUID for communicating with the outside world is often very useful.

This is true for an internal PK that's an auto-inc id as well as for natural keys, though.

Using a natural PK -inside- your own database can still be a lot more pleasant to work with, even if you don't let it escape.

shkkmo
0 replies
1h55m

Using a natural PK -inside- your own database can still be a lot more pleasant to work with

Until you need to do anything like described above. The advantage of artificial keys is that they have no semantic content. Anything with semantic content carries the risk that the role that semantic content plays in your system can change and cause problems. Having a non-semantic identifier protects you from that.

This is not to say that you should never use a semantic identifier as a key. However, you should always have a non-semantic artificial key as the identifier and use the semantic identifiers only when necessary.

yashap
0 replies
4h37m

Sure, if all data is in a single DB. But in the real world you’ve generally got some/all of:

- 1 or more data warehouses

- Other services storing said data (e.g. the user id will live in many databases in a service oriented architecture)

- External API integrators who have their own data stores totally out of your control that also have copies of parts of your data

- Job queues. It’s common to have jobs scheduled for the future on some other system (Redis, etc.) that say “do X for user with id Y in Z days”. If the “id” changes these fail

- Caches. Say I cache data by user email instead of a surrogate key, user changes their email, and another user signs up with the old email. Cache hits for the wrong user!

- etc.

Changing the primary key becomes an absolute nightmare project once data lives in multiple places like this, and in my experience it mostly does.

Having truly stable identity for your entities just solves so many future problems, and it’s SO easy to do. In almost all cases, natural PKs are really all downside, virtually zero upside, except slightly less storage.

sgarland
0 replies
6h28m

I was going to say; this is a perfect use case for a cascading FK.

maweki
9 replies
11h31m

I challenge you to come up with a single plausible example

So I come from academia, but generally if you use a natural key as PK in a foreign key constraint it may be possible to express additional consistency criteria as CHECK-constraints in the referencing table.

So this is a bad example, but say you have Name and Birthdate as your PK, and you have a second table where you have certain special offers sold to your customers and there is this special offer just for Virgos and Pisces, you could enforce that the birth date matches this special offer. Some modern systems also technically allow FK on alternate keys, so you could still do it that way, but database theory often ignores that.

But second, while I agree that surrogate keys are often a good idea, I find your argument, that you must design for every conceivable change, not convincing.

spoiler
3 replies
11h8m

Another example is where you use a service that provides you with a stable id. It makes little sense to add a surrogate id and a fk on that surrogate id. It violates data quality and integrity just for a hypothetical situation.

Data integrity/quality matters. Adding friction to prevent accidents also matters. I don't want something accidentally and trivially updating a field that's used to reference thing externally.

Something about the nat key is about to change? Fine, we can write migrations. Even if it affects millions of rows, it's not a big deal.

I understand people have been burnt by bad design decisions involving nat keys, but they're not some devil's key everyone here dogmatically makes them out to be. You can mess up using anything.

maweki
1 replies
11h2m

a service that provides you with a stable id

I think there's the important point. Is your key actually natural or is it someone else's surrogate key anyway? Going back to the vehicle identification number: that's already a surrogate key. You just did not assign it yourself.

sgarland
0 replies
6h6m

A VIN is not a surrogate key. A surrogate key must, by definition, have no semantic meaning, and not be composed of parts from multiple domains (among other requirements).

A VIN encodes the following:

* Country of origin

* Manufacturer

* Vehicle type

* Vehicle model

* Engine type

* Transmission type

* Model year

* Manufacturing plant

* Serial number

shkkmo
0 replies
1h32m

It makes little sense to add a surrogate id and a fk on that surrogate id. It violates data quality and integrity just for a hypothetical situation.

I would still almost always use an internal artificial key on top of the external id. If you want to enforce data integrity, you can still enforce uniqueness on the external id. "Stability" of an external identifier is almost always an assumption and one that I've seen fail enough times to want that internal id by default.

zelphirkalt
1 replies
11h14m

Why does anything need to be a primary key anywhere in order to enforce some constraint? At least from ORMs I know I can set for example any group of attributes unique. Other constraints can be implemented in some general method that is called when persisting in the actual database. Even if no ORM, you can write a wrapper around your persisting procedure.

sgarland
0 replies
6h15m

It doesn’t, you’re right. However, indexes aren’t free, so if your data is such that a natural PK (composite or otherwise) makes sense, you’ll save RAM. Also, for clustering RDBMS like MySQL, data is stored around the PK, so you can get some locality boosts depending on query patterns.

lstamour
1 replies
11h9m

Some modern systems also technically allow FK on alternate keys

As far as I can tell, all modern systems allow it, as it is part of the SQL standard that foreign keys can be either primary keys or unique indexes. Here's a brief quotation from a copy of ISO/IEC 9075-2:1999 (not the latest version) that I randomly found online:

If the <referenced table and columns> specifies a <reference column list>, then the set of <column name>s contained in that <reference column list> shall be equal to the set of <column name>s contained in the <unique column list> of a unique constraint of the referenced table.

So it mentions unique constraints first. Then afterward it says:

If the <referenced table and columns> does not specify a <reference column list>, then the table descriptor of the referenced table shall include a unique constraint that specifies PRIMARY KEY.

If I'm reading this right, it means that in the base case, where you specify the column to reference, it can be any unique constraint, where a primary key is just another possible unique constraint (as all primary keys are by definition unique). And only if you don't specify the fields to reference does it then fall back to the primary key instead of a named unique constraint.

I'm not disagreeing with you entirely - it's true that often there's an assumption in database theory that primary keys are natural and foreign keys are primary keys. But this isn't a hard requirement in practice or in theory, and it partly depends on the foreign key's purpose, why you need it in the first place. This StackOverflow answer also explains it well: https://softwareengineering.stackexchange.com/a/254566

I should add that there is also a set of database design wisdom that suggests you should never use database constraints such as foreign keys, only app/api constraints, but that's a whole different tangent.

sgarland
0 replies
6h17m

I should add that there is also a set of database design wisdom that suggests you should never use database constraints such as foreign keys, only app/api constraints, but that's a whole different tangent.

That’s less a DB design thought and more of a “devs with little formal training in RDBMS who only want to use it as a dumb store” thought.

Use the DB to its strengths. CHECK constraints add an infinitesimal amount of overhead for writes, and guarantee that there will never be invalid data written. A bad code change could allow, however briefly, for that to occur.

cjfd
0 replies
11h19m

"for every conceivable change".

That is not what he is arguing at all. He is showing that there are very many highly plausible changes that are problematic with natural keys. And he totally correct about that. Frankly, the fact that a post arguing for natural keys makes it the top of an HN comment thread is extremely weird. The original article is correct that natural keys are bad.

knallfrosch
4 replies
11h13m

then we add the ability for users to update their email.

At this point, you should verify the new email. At least until it is verified, you must track the old email. At this point, you realize you can now introduce a synthetic key and you're fine.

Let's say you have a duplicate customer entry and the customer demands their accounts be merged. Now you can't identify the user by their key alone, since by definition, they can't be the same (yet.)

epcoa
3 replies
9h42m

At this point, you realize you can now introduce a synthetic key and you're fine.

Except for having to update all foreign references. Some of those may be external further complicating issues.

Emails are often among the worst keys because they are not terribly stable and they are reusable often enough to burn you.

jrs235
2 replies
7h24m

Also are emails case sensitive or not? In some systems (that you don't control mind you) they are and others they are not...

sgarland
1 replies
6h23m

Per RFC5321, the local part (before @) _may_ be case-sensitive, but in practice, it almost never is, and relying on case sensitivity is a recipe for disaster.

The domain must always be case-insensitive.

jrs235
0 replies
1h11m

My point being is if you're using email as a key you "have to" treat it as case sensitive even though for most it's not. And yes, I agree it will be a recipe for disaster.

ourmandave
1 replies
7h45m

If I’m going natural PK, I make email the primary key.

Welcome to the Mr. Cooper mortgage provider website.

Your logon is your email and you can't change it.

If you used your cable provider email you're stuck with them for the life of your 30 year mortgage.

bradleyankrom
0 replies
6h47m

Oh, also, we've been breached and your information is available for purchase on the dark web. Fun!

nottorp
0 replies
7h4m

Funny you should say that...

I've been very slooowly degoogling myself, and that includes changing all logins that have a gmail address to a different non google email.

I'd say only like 1/3 of the sites I made logins for have the option of changing the email.

bioneuralnet
0 replies
5h58m

My first job in the late 2000's was at a small university with a home-grown ERP system originally written in the 80s (Informix-4GL). Student records, employee records, financials, asset tracking - everything. It used natural compound keys.

Even worse than the verbose, repetitive, and error-prone conditions/joins was the few times when something big in the schema changed, requiring a new column be added to the compound key. We'd have to trawl through the codebase and add the new column to every query condition/join that used the compound key. It sucked.

arnorhs
0 replies
9h10m

Another benefit of having stable identies / surrogate primary keys is that any relations (FKs) will be much simpler.

Sure, like the post poster you replied to is pointing out, you _can_ use natural keys, and then also relying on dates or other parts of the data - but creating a relation for that can end up being extremely cumbersome.

- Indexes generally become larger - relationships become harder to define and maintain - Harder for other developers to get up to speed on a project

crabmusket
18 replies
14h40m

The surrogate key uniquely identifies a row in your database, which is an entity just as real and significant as the car or the employee or what-have-you. Don't confuse the two!

I agree with you that having a surrogate key isn't going to save you from the reasons why natural keys can be difficult. The complexity has to go somewhere. But not having a unique identifier for each row is going to make things extra difficult.

hobs
10 replies
14h30m

The main thing is that the synthetic key should never leave the database and never be displayed in the app - if you want to have another key that represents the human oriented key do it, but it should be another field, an indexed field even, but one that has a lot less monotonic sequential properties that are inherent to synthetic database identifiers.

You want to change that human key? Sure. You want to to complain that the keys are not sequential? Sure. You want to actually make them weird strings that are harmful to my brain? Why not? You want to update primary keys in the database? No. Absolutely not.

The_Colonel
9 replies
13h23m

I assume you hint at the security aspect of monotonic keys?

I've found this issue a bit overblown. It's basically security by obscurity, which is a nice bonus, but not something your security model can be based on.

I mean, it is a good practice to expose some kind of non-sequential key (e.g. UUIDv7), but it doesn't seem to me like a dealbreaker.

throwaway8486k
5 replies
12h6m

Security by obscurity makes it more difficult for bad actors as an additional layer that they need to break. I don't see a reason why that's a bad thing and doesn't take a lot of effort to implement in this case.

I have been in a startup where competitors used our sequential keys to scrape a list of customers. Lesson learned the hard way with actual business consequences!

Sequential keys also leak information (German tank problem)

Your competitors can estimate number of customers, your growth rate and other stats that you often don't want them to know.

https://en.m.wikipedia.org/wiki/German_tank_problem

yellowapple
2 replies
9h7m

I have been in a startup where competitors used our sequential keys to scrape a list of customers.

If your system allows customers to see each other (or worse: unauthenticated users to see customers) in this fashion in the first place then whether you're using a sequential integer v. a random UUID is the least of your problems.

mgkimsal
1 replies
6h30m

The 'customers' could be free tier users - a social media type system where everyone has a public profile - intended for the public - would still be scrapable by /profile/1, profile/2, etc. Doesn't necessarily require 'authentication' for the exposing of sequential integers to have a bad outcome.

throwaway8486k
0 replies
1h34m

You're right. The urls were public to be shared (think of marketing material / ecommerce), so there was not a security incident.

But it did give our competitor free highly qualified leads that they could use to poach customers. This product was new to our customers, and we had spent a lot of time selling and convincing them that it was useful.

The_Colonel
1 replies
11h45m

I'm not saying it's a bad practice, the opposite actually.

I don't see a reason why that's a bad thing and doesn't take a lot of effort to implement in this case.

True, if you start your application from scratch. Like if I started designing a new app today, I'd just choose the UUIDv7 for the primary key.

It's not an easy thing to add into an existing application, though. I see applications leaking their internal IDs all the time, but usually it's not worth the effort to fix that, because it's a comparatively minor problem.

throwaway8486k
0 replies
10h56m

I'm not saying it's a bad practice, the opposite actually

I'm sorry that I didn't make that more clear. I saw that you mentioned it as a best practice and are aware of the advantages. It's just that there are so many others that don't have the balanced view as you seem to have.

I have been involved in many discussions at my work place where "security by obscurity" is used as a way to shut down discussions. They changed their minds about sequential keys after the incident I mentioned, but it still has the power to "win" other discussions. Sure, we need to have rate limiting on ip-addresses, auth and other mechanisms, but they are not perfect and bugs happen all the time. An "unguessable" id is an additional security layer

It's not an easy thing to add into an existing application, though

I agree, but there are ways to reduce the attack surface. You could add an extra "public id" field that can be used for lookup in addition to the existing id. In this way you can have a gradual migration where you go through each endpoint and migrate them individually without changing the foreign keys and other relations in the database (they would still use the sequential key). Maybe you end up not having time to migrate them all, but at least you can reduce the attack surface on the most sensitive endpoints.

If you have low volume endpoints you could perhaps even simply add a mapping layer where you do a simple db lookup to replace the public key with the internal without changing the existing queries. You could even cache this easily to reduce the db load. (both ids are permanently fixed and can be cached forever).

inopinatus
1 replies
11h17m

UUIDv7 may not be generated sequentially, but it is still temporally ordered and discloses a timestamp, which may be an undesirable leakage for some applications/environments. When obfuscation matters that much, use a UUIDv4 and eat the index performance hit.

Some might suggest, "encrypt your object identifiers on the way in/out", but there's a ton of pitfalls involved since for most applications they are now rolling their own crypto, and it also makes identifiers much longer.

The_Colonel
0 replies
11h4m

Yes, you can go to great depths, but they each have trade-offs - in performance, increased complexity etc. and it's necessary to make a judgment for each particular app instead of applying the most overengineered solution everywhere.

crabmusket
0 replies
12h7m

I love it when my competitors use sequential integer IDs.

mkleczek
4 replies
10h53m

The surrogate key uniquely identifies a row in your database, which is an entity just as real and significant as the car or the employee or what-have-you. Don't confuse the two!

But the DBMS already maintains a row identifier (called rowid or ctid or whatever depending on the DBMS). Why do you need an explicit one?

unchar1
0 replies
10h44m

It may be useful if you have data that originates from another source or if something outside of our system references your entity. In that case you need to keep some form of an externalRef, so it's usually easier to just use an id that you can control, for referencing both internally and externally.

simonw
0 replies
9h31m

Be careful with those. SQLite has a rowid concept, but it's not guaranteed to be stable - running a VACUUM against a table can reassign the rowids for every row!

https://sqlite.org/rowidtable.html says:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.
sgarland
0 replies
6h1m

Because every DB can and will shift those as needed. They reference the physical location on disk for a given tuple. They are not meant for general consumption.

magicalhippo
0 replies
10h31m

In the DB we use[1] the internal row id is not stable:

The value returned by the function is not necessarily constant between queries as various operations performed on the database may result in changes to the row identifiers of a table.

So, users should refrain from using the ROWID function in ordinary situations; retrieval by primary key value should be used instead.

[1]: https://infocenter.sybase.com/help/index.jsp?topic=/com.syba...

stouset
0 replies
14h21m

Precisely. And if I have a surrogate key to identify “a car”, I get to define what makes a car unique for my purposes. Maybe that really is the VIN. Maybe today it’s the VIN, but tomorrow it’s whatever vehicle is currently being driven by its owner. Maybe it’s something else.

Some day I may need to track multiple VINs for a vehicle (maybe it’s got parts from multiple VINs and I want to track that). I can still always decompose that table and have an n-to-1 relationship between cars and VINs without migrating the rest of my data model.

hackit2
0 replies
11h49m

A entity can exist over more than one row in your database, but it is useful to uniquely identify each row as the lowest common denominator.

paulddraper
7 replies
12h52m

Terrible advice.

Surrogate keys are keys are a layer of indirection.

They don't fix all problems, but they fix some problems.

Not least of which is performance. Often natural keys are character strings, whereas surrogate keys can be fixed size integers, saving index sizes on your FKs.

mrkeen
3 replies
11h2m

surrogate keys can be fixed size integers

This launches into the other debate about PKs: using UUIDs rather than sequential keys.

sgarland
2 replies
5h58m

This is less of a debate, and more of an indicator of who has had to work with a DB at scale using UUIDv4 everywhere.

Don’t blow up your B+trees.

mrkeen
1 replies
4h1m

I guess.

Either your system is happy enough to route every new entity through "one DB at scale" so it can let your "one DB at scale" be in charge of an auto-incrementing long, or it isn't.

sgarland
0 replies
1h19m

A common method is to have a small app (which can quite easily be HA) that hands out sequential chunks to each shard, interleaving and/or gapping as necessary.

mypalmike
2 replies
12h13m

Conversely, certain queries can be much faster by using natural keys when the FK is all that you need in the result rather than additional fields in the primary table. In this case, the primary table doesn't need to be queried at all. This doesn't generally overcome the benefits of synthetic keys, but it is an optimization sometimes put into practice.

crabmusket
1 replies
12h5m

Aren't you literally describing an index?

mypalmike
0 replies
26m

No. Maybe an example helps. You have a users table with username as natural PK. You have an access table with timestamps of users hitting a service, with FK of username. If you query the access table for a range of timestamps and just want the usernames, they're right there in the results of the access table query. If you had instead used a synthetic user_id key, the db would have to do an additional lookup in the users table via a join for each user_id to get the usernames.

CrimsonRain
6 replies
13h2m

Sorry. This is a very bad advice. I just had to fight tooth and nail to make my lead turn around from this disastrous decision. Using a lot of external IDs as our own row primary keys and then they get propagated to all other tables as foreign keys and what not. One day the foreign key chances or God forbid, the formatting changes in external systems, now we need to fix our whole database and all codes instead of a small isolated place.

Generate your own unique keys for everything; add a few more unique constraints if needed. A bit more work but never a regret.

aidos
5 replies
12h18m

Yeesh. I once made the mistake of using an external ID as a primary key. What a day it was when they were changed on me.

naasking
2 replies
6h4m

I once made the mistake of using an external ID as a primary key. What a day it was when they were changed on me.

I've kept with this advice for the most part, but I'm tempted in some cases to use the external id when there's some guarantee of stability and universality. Like 2 and 3 digit ISO country codes.

Piskvorrr
1 replies
1h4m

Not that I'd get about 5 different ISO country code changes (with some flipflopping) just by sitting in this very same spot for a couple of decades. "Stability" in country codes, bah humbug.

naasking
0 replies
22m

Your geographical location might not be stable, but I'm referring to the ISO codes that name the country, which should be relatively stable.

spoiler
1 replies
11h40m

Can you share more about this? Wouldn't you run into the same problems if you used a surrogate pk? Without the nat/external pk and fk, you run the risk of having validity issues.

Conversely, if the ID changes, isn't the friction what you want?

I feel like optimising for unlikely edge cases instead of integrity because of a single incident is too reactionary.

Writing a query or script to update a string, even for millions of rows, isn't that big of a deal?

Obviously there _are_ cases where nat pks/fks are bad, but not all of them.

mixmastamyk
0 replies
11h5m

Instead of migrating one column, now you have that and every table with a foreign key to the original. Requiring transactions and possibly locking etc.

aswerty
5 replies
12h25m

Surrogate keys do mirror reality though. As I once read in a Terry Pratchett book; if you replace the handle of an axe and then replace the head, is it still the same axe?

For me, the answer is yes - since we imbue the axe with an identity outside of it's integral parts.

That is what a surrogate key is. An identity. Which is an abstract concept that exists in the real world.

And to pile on. The top comment is bad advice! Surrogate keys provide sanity - god save you if you have to work in a database solely using natural keys.

fuzztester
2 replies
9h37m

As I once read in a Terry Pratchett book; if you replace the handle of an axe and then replace the head, is it still the same axe?

Yes and no. It is the Axe of Theseus ;)

https://en.m.wikipedia.org/wiki/Ship_of_Theseus

aswerty
0 replies
8h45m

Thanks - I think somewhere in the back of my mind I was also aware of the Ship. But Pratchett is too good. The quote:

This, milord, is my family's axe. We have owned it for almost nine hundred years, see. Of course, sometimes it needed a new blade. And sometimes it has required a new handle, new designs on the metalwork, a little refreshing of the ornamentation . . . but is this not the... axe of my family?
vintermann
0 replies
12h10m

Yes. Ultimately, "are these the same" and "are these different" are philosophical questions. Or to be more precise, teleological questions.

Because those questions have no meaning except with an "for our purposes here" added. And it's up to us to decide what we care about, if anything.

If we care about what other people want with the data though, or suspect that our own wants might not be set in stone, then we should also care to model identity independently in the system (that is, use surrogate keys).

akira2501
0 replies
8h40m

since we imbue the axe with an identity outside of it's integral parts.

Typically for the purposes of ownership. So it's really part of a a hierarchical identity scheme.

polemic
3 replies
14h38m

In the wise words of Patsy, "it's only a model".

The real world is resistent to clean abstractions and abstractions are distressingly subject to change. What made your row unique today is quite likely to become non-unique in the days/months/years to come.

Always use surrogate keys. Your future self will thank you.

hehdhdjehehegwv
2 replies
14h8m

And always kids, write code for one person, and one person only: your future self.

M95D
1 replies
11h18m

Yeah, fsck your co-workers and your replacement when you quit.

hehdhdjehehegwv
0 replies
1h24m

If you’re good to your future self you’re also good to them.

huygens6363
2 replies
13h33m

There is a model of a thing and there is a row that stores a representation of that model of a thing.

They both are things. Ignoring the last one might be tempting, but it’s not practical.

Interestingly your own way of thought is applied, but now a level deeper again.

How do you model a row? What makes it unique? A surrogate ID is the only sensible unique identifier for such a thing as there is no “natural key” that would make sense for instances of something so general as “Row”.

What you were saying amounts to “don’t model the thing holding the model”, but experience shows the thing holding the model is itself an (often unwilling) active part of systems.

Someone here gave the example of wrongly entered PK’s by administrative personnel handing live customers. That’s IMO a good example of why you need an extra layer on top of your Actual Model(c). I can think of more.

crabmusket
1 replies
12h7m

They both are things.

Corollary: your app is part of the real world.

TeMPOraL
0 replies
6h38m

Furthermore: bugs in your app are part of the real world, so woe to those who used keys from your app as natural/external keys in their app.

globular-toast
2 replies
11h36m

How do you deal with the Ship of Theseus/Trigger's broom? There's literally nothing that defines said object apart from its history.

zer00eyz
1 replies
10h45m

Your database starts to look like git...

Think about a jet engine. Let's say you figure out that a part is defective and will cause a failure. You want to identify every plane that has one of those parts in it. What if you find that some of them had a bad oil pump that shortened the life of some bearings. You want to know every engine that had one of those pumps, so you can replace other parts.

I dont know if they do this with jets but there are quite a few places that take thee concept much further than this.

smaudet
0 replies
7h40m

Your database starts to look like git...

No, your table archives start to look like git. Which is not a bad thing, version control on data in a database is very difficult, if not impractical to realize, I would take any history that I can get...

"Synthetic" keys are just that, keys for tables that only have meaning within the context of that data. You can of course relate your records to as many other tables as you wish, but you identify records via those keys, nothing more.

wruza
1 replies
8h3m

In short, you advise us to foresee the future, explore unknown unknowns and expect high-precision true answers from the outside. Good advice, not for this universe. You can only get false negatives in this one.

A synthetic key means “we think exists”. There exists a contract, a medical record, a person, in a real world, in our opinion. We record these existence identities into our model by using an always-unique key. Then there’s a date, an appointment #, a name, etc. You can refer to an entity by its identity, or search by its attributes. If you use searches in place of identity references, you get non-singletons eventually and your singleton-based logic breaks.

datadrivenangel
0 replies
5h38m

This is why it's hard to be a DBA... Everyone thinks you're a Cassandra user the way other developers ignore your prophecies.

devjab
1 replies
12h59m

In terms of the Danish CPR that is mentioned here, the way we actually solved the challenge in the national architecture strategy was to define your social security number(s) as what we call an address on your person. I’m not sure why it was called an address, but it’s basically a UUID and the information. Maybe it’s because it was first used to store addresses?

Anyway. Unless your system has not yet implemented the national strategies (which by now are approaching 24 years) then changing a Danish Social Security won’t actually matter because it’s just an added address to your person “object”. So basically you’ll now simply have two, and only which is marked as the active. Similar to how you’ve got an array of addresses in which you have lived but only one main address active.

It did indeed cause a lot of issues historically because it was used as a natural key… though with it being based on dates, it was never really a good candidate for a key since you couldn’t exactly make it into a number unless you had some way to deal with all the ones beginning with 0 being shorter than the rest. Anyway… it was used as a key, and it was stupid.

Anyway I both agree and disagree with you. Because we’ve successfully modelled the real world with virtual keys, but you access it by giving any number of natural keys. Like… you can find the digital “me” by giving a system my CPR number, but you can also find me by giving a system my current address. Technically you could find me by giving a name, but good luck if you’re dealing with a common name. There is a whole range of natural keys you can use to identify a “digital” person, but all of it leads from a natural key into a web of connected virtual keys.

All of it is behind some serious gatekeeping security if you’re worried. Or at least it’s supposed to be.

mrweasel
0 replies
7h1m

It's been years since I worked on systems with CPR numbers, but I seem to recall that there is also a policy in place, stating that you are not allowed to use the CPR number as a "primary key". Many companies did anyway, because they never actually bothered to read the guidelines and regulations.

All the systems I've seen always had the CPR as a lookup for a UUID, which as you say is the the "address" of the actual person object.

zer00eyz
0 replies
10h12m

Model your data on the real world. Do not depend on spherical horses.

Yes, and normalize it. https://en.wikipedia.org/wiki/Database_normalization

Adding a synthetic key only means you have to track all three. Plus you have to generate a meaningless number, which is actually a choke point in your data throughput.

This is true up to a point. You can add more data to the system to continue to generate natural, composite keys. However at some point you move from a database to an event stream, or you have to track events that aren't really needed for what your doing...

Denormalization then takes precedence and a generated key makes sense again. https://en.wikipedia.org/wiki/Denormalization

how will the surrogate key protect you?

It isnt about protection, it's about not collecting the natural data to identify the event that caused the issue. Its denormalization by omission in effect.

thefz
0 replies
7h14m

Been vehemently against surrogates my whole life, glad to find a kindred soul.

th3byrdm4n
0 replies
11h24m

When I integrate systems, I use that system's natural key (love it when it's a unique ID, but in the systems I work in - it almost never is).

That said, I use that natural key as the "link" to my internally managed, normalized database.

There's nothing that says I cannot add unique identifiers that would replicate the natural key. In fact, that's good design.

tadfisher
0 replies
14h28m

In that case, using this natural datum as an ID is a contradiction, because we are now removing the uniqueness constraint. It's fine to model the real world, but the real world also includes your data, and you may want to identify unique _records_ as it is not a universal truth that data can be corrected on entry.

somerandomqaguy
0 replies
9h16m

The natural key forces you to think about what makes the row unique. What identifies it. Sometimes, it makes you go back to the SME and ask them what they mean. Sometimes it makes you reconsider time: it’s unique now, but does it change over time, and does the database need to capture that? In short, what are the boundaries of the Closed World Assumption? You need to know that too, to answer any "not exists" question.

Not really because your natural ID has to also account for the problem of garbage data and account for the SME's not actually being experts. And I can give a real world example of this happening; the Canadian Long Gun registry.

For anyone that doesn't know, prior to the early mid 90's or so Canadian gun laws only required registration of pistols. I might be incorrectly remembering but IIRC the records were not handled at a national level either but I could be wrong. Around that time new laws were introduced that among other things required registration.

Most guns by then had serial numbers so all you had to do was tie a gun to a serial number with some characteristics and voila, you've got a natural identifier, right? That's what the experts say.

Well as it turns out, reality isn't quite so kind. A lot of firearms makers such Cooey from around the 1940's to the 1970's produced guns without a serial number. In other cases the serial number was present from the factory but were damaged, or the part that had thee number had been replaced without replacing the number or had been replaced with a wrong number. In rare cases the serial number from the factory was wrong because of a mistake when the worker manually stamped in the number.

So already the idea to uniquely identify using some sort of simple classifier was already flawed. They attempted to solve the issue of guns without serial numbers but those stickers were cheaply made and readily fell off, and owners that were already peeved about the program to not bother with trying to paperwork correct.

Which segways to the next problem. There was an extremely high rate of errors in registration forms being submitted. The most famous example I'm aware of was someone registering a Black and Decker soldering gun as a firearm, something he had done in protest. As humorous as it was, it the revelation that a soldering gun had been classified as a firearm unveiled another fundamental problem.

The error rate was so high, and the pressure to show progress so great, that the someone in leadership (I can't remember if it was the government or RCMP) directed the data entry clerks to just plug the data in with no validation as is. Didn't matter if the data was wrong, or made no sense, or contradicted other entries already in the database. The intent being to just get all the data in as is so that they could fix it later. So all that wrong information? Got pushed straight into the database.

Like I said, this was real world mess that occurred from 1995 until 2012 when a new government dropped the requirement for non restricted firearms to be registered with little fanfare and only squeaking protests.

It's not to say that you shouldn't think about a 'natural key' persay. But the problem assuming that there is a 'natural key' requires that you or your subject matter expert is actually an expert that can identify a good enough model for that to exist.

But what happens when your SME is just plain wrong and it in turn introduces fundamental flaws in your model? Or outside influences forces garbage data in? How is a database designed to model only the real world supposed to cope with that?

sgarland
0 replies
5h35m

Your post brings up a critical difference I’ve noticed when working with devs (I’m a DBRE): those who actually do rigorous data modeling, and those who view it as an annoyance impeding their project.

Spend time modeling your schema. Ask yourself, “does every attribute in this table directly relate to the primary key? And is every attribute reliant upon the primary key?” Those two alone will get you most of the way through normalization.

ratherbefuddled
0 replies
8h52m

how will the surrogate key protect you? It will not.

Yes it will. Your changes will be confined to only the table(s) where the natural key is present, not spread across every table where there's a foreign key.

Of course you will still have to deal with the reality that the natural key is now not unique, and model reality, but your implementation work in doing so is far simpler.

In more years than I care to count I've regretted someone using natural keys as a primary key for a table many times, and surrogates never.

kayodelycaon
0 replies
14h41m

Of course they won’t save you from external data. The whole point is for your system to have a way to identify rows internally so you can deal with external systems getting wonky without corrupting your own data.

All of your concerns are easily solved by a unique index.

Using external keys as a forcing function to prevent people from representing data wrong is not great.

jtriangle
0 replies
13h3m

Model your data on the real world. Do not depend on spherical horses.

This took me years to realize, and once I did things became much, much simpler.

imtringued
0 replies
6h46m

Plus you have to generate a meaningless number, which is actually a choke point in your data throughput.

No it isn't. Working with natural keys in general involves using compound primary keys since it is unlikely that any lone field is suitable as primary key. Comparing an integer is quick. Comparing three string fields in a join is not.

hyperman1
0 replies
10h57m

In my experience, this won't end well. Some examples:

Belgium has the RNR/INSZ identifying each person. But it can change for a lot of reasons: It gets reused if someone dies. It encodes birth date, sex, asylum state, so if something changes (which happens about every day), you need to adapt your unique key.

Belgium also has a number identifying medical organizations. Until they ran out of numbers. Then someone decided to change the checksum algorithm, so the same number with a different checksum meant a different organizations. And of course they encode things in the number and reuse them, so the number isn't stable.

An internal COBOL system had a number as unique key, and this being COBOL, they also ran out of numbers. This being COBOL, it was more easy to put characters in the fixed width record than expand it. And this being French COBOL, that character can have accents. So everyone using the 'number' as unique key now had to change their datatype to text and add unicode normalization. Not fun.

In my experience: don't use anything as an ID that you didn't generate yourself. Make it an integer or UUID. Do not put any meaning in the ID. Then add a table (internal ID, registering entity, start date, end date or null, their key as text). You 'll still sometimes have duplicates and external ID updates as the real world is messy, but at least you have a chance to fix them. The overhead of that 1 lookup is negligable on any scale.

groestl
0 replies
12h13m

Adding to the list of comments damning this post to ensure none of my future colleagues follow this advice.

You think your surrogate key will save you? It will not.

It definitely will, say my 20+ years of experience.

abujazar
0 replies
6h33m

Natural keys can change, and synthetic keys never have to. That alone is reason enough to use synthetic keys.

Performance is another major argument for synthetic keys, as they are can be made sequential, which is rarely the case for natural keys.

SPBS
0 replies
12h40m

You think your surrogate key will save you? It will not. The world has an external reality that needs to be reflected in your database.

Yes, it will. It is precisely because of a messy external reality that you need an unchanging internal ID that is unaffected by changes to the external ID. If the software designer in the article had followed your advice, changing the chassis number would have likely resulted in broken car ownership records.

Whoever is reading this in the future, please don't follow the parent comment's advice. Use surrogate/synthetic keys for your primary key.

RedShift1
0 replies
11h25m

Every time I used a natural key I have to come to regret it.

IlliOnato
0 replies
6h21m

Sorry, by your logic an ISSN would be a good key for a database of scientific journals. It's exactly what ISSN is invented for! Right? Right?

Been there, done that. Journals that changed their names (and identities) but not ISSN. That changed the ISSN but not the name/identity. Journal mergers which instead of obtaining a new ISSN kept one of the old ones. "Predatory journals" that "borrow" an ISSN (you may not consider them real journals, but you've got to track them anyway, even if only to keep them from being added to the "main" database). The list may go on and on.

And don't even start me on using even more natural ID, the journal name, perhaps in combination with some other pieces of data, like year the publication started, country of origin, language, etc... Any scheme based on this will need to have caveats after caveats.

(A fun fact: there were journals that ceased publication but later on "returned from the dead". Such resurrected journals are supposed to be new journals and to get a new ISSN. Sometimes this rule is followed...)

At the end, a "meaningless number" you assign yourself is the only ID that reliably works (in combination with fields representing relationships between journals).

The problem with keys that "have meaning" is that they appear to carry information about your entity. And in vast majority of cases this is correct information! So it's almost impossible to resist "extracting" this information from a key without doing actual database lookup at least mentally, and often in your software too. Hidden assumptions like this lead to bugs that are really hard to eliminate. A meaningless number on the other hand does not tempt one :-)

DrScientist
0 replies
7h28m

The natural key forces you to think about what makes the row unique. What identifies it.

When designing a table - you should always be clear about what the natural key is and make sure it has uniqueness constraint. Mindlessly having a surrogate key without thinking about what the natural key is, is an anti-pattern. So totally agree here.

That doesn't stop you also having a surrogate key though.

Another aspect of natural versus surrogate keys is joins as the key often ends up in both tables.

Using natural keys can mean in some circumstances you can avoid a join to get the information you want - as it's replicated between tables.

There is also the question of whether you surface the surrogate key in the application layer or not - some of the problems of surrogate keys can be avoided by keeping them contained within the app server logic and not part of the UI.

So via the UI - you'll search by car registration number, not surrogate key, but in terms of your database schema you don't join the tables on registration number - you use a surrogate key to make registration numbers easier to update.

strnisa
99 replies
10h48m

I've become a fan of unique, relatively short and "human-readable" IDs, such at the ones used by Stripe, e.g. `cus_MJA953cFzEuO1z` for an ID of a customer. Here's a Stripe dev article on the topic: https://dev.to/stripe/designing-apis-for-humans-object-ids-3...

If you use JavaScript/TypeScript, you can make them like this:

    function makeSlug(length: number): string {
        const validChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
        const randomBytes = crypto.randomBytes(length);
        let result = "";
        for (let i = 0; i < length; i++) {
            result += validChars[randomBytes[i] % validChars.length];
        }
        return result;
    }

    function makeId(tableName: string): string {
        const idSpec = TABLE_NAME_TO_ID_SPEC[tableName];
        const prefix = idSpec.prefix;
        const slugLength = idSpec.length - prefix.length - 1;
        return `${prefix}_${makeSlug(slugLength)}`;
    }

strnisa
31 replies
8h20m

Thanks for all the improvement suggestions! Taking them into account, the `makeSlug` function becomes:

    function makeSlug(length: number): string {
        const alphabet = "0123456789abcdefghjkmnpqrstvwxyz";
        let result = "";
        for (let i = 0; i < length; i++) {
            result += alphabet[crypto.randomInt(alphabet.length)];
        }
        return result;
    }

biztos
22 replies
7h48m

I think that 0 and 1 are likely to cause problems when customers end up reading their "user ID" back to your employees in Customer Support Country over the phone.

"It's one-three-oh-dee-ee-el. Yes, I'm sure, EL as in elephant."

kstenerud
6 replies
6h10m

I developed safe32 for this reason.

https://github.com/kstenerud/safe-encoding/blob/master/safe3...

Notably, confusable characters are interchangeable when being ingested (although a machine encoder MUST always produce canonical output). https://github.com/kstenerud/safe-encoding/blob/master/safe3...

So a user can confuse 1 for l, 0 for o, I for l, u for v, uppercase, lowercase etc, or the agent can say any of those over the phone, and it won't matter.

what
1 replies
2h13m

Isn’t this just crockford encoding?

kstenerud
0 replies
1h8m

Every base-32 style encoding is Crockford at the core. The difference is in the alphabet, and also whether it requires padding or not (safe32 does not).

Crockford also incorporates error correction, which is unnecessary in modern systems since the underlying protocols do that already.

vanderZwan
1 replies
5h35m

Oh that looks well-thought out and is probably the sanest way to solve this particular problem!

It's obvious why the safe64/safe80/safe85 cannot do this, but is there a reason why the safe16 version doesn't have the same features?

kstenerud
0 replies
5h22m

Oh whoops that's an oversight! I'll fix that up tonight.

kstenerud
0 replies
1h10m

You still have the problem of 1 vs l. And also it doesn't support user-error (reading 0 as O, or reading V as U).

smaudet
5 replies
7h25m

I wonder if Unicode could be used to alter the characters such that these mistakes would be less possible, e.g. using ⓪.

fhars
3 replies
6h16m

That is "AT", isn't it?

(No, it isn't, if you look closely enough.)

Piskvorrr
1 replies
1h26m

It's obviously a �. Or perhaps a □ . Maybe an ¾, on odd Wednesdays?

(Unicode has its strengths. Making up replacement characters isn't one.)

codr7
0 replies
14m

I happen to know that the biggest ski resort reservation system in Scandinavia contains a function called MaybeOnATuesday(), but to my knowledge it's never called.

smaudet
0 replies
4h40m

Yeah...what I'd really like to do would be to give a character a "natural" background color, e.g.

Then its simple for support to say "red is one, green is ell". But you can't just add a color to a character, because copy paste/rich formatting don't work everywhere, or even transfer well...

Alternatively, if you use ⓪ and ⒈ it matters less if the user says "at" or "zero", and more that they didn't say "oh" or "one".

WorldMaker
0 replies
49m

If you are trying for URL safe, Unicode is problematic because of Punycode conversions and differing browser behavior with Unicode URLs. (Some browsers always show Unicode as Unicode in URLs. Some browsers always show Unicode as Punycode in URLs. Some browsers switch between the two based on a huge number of variables such as gTLD, user preference, phase of the moon, etc.)

gvx
4 replies
7h11m

Anything that needs to be read over the phone should probably be written out using something like the NATO phonetic alphabet, split into smaller chunks if needed: "The code? It's kilo eight niner; one three mike; delta echo lima."

8372049
3 replies
7h2m

Having come from a military background where using that is second nature, I'm constantly surprised how rarely I meet civilians who understand it effortlessly. When picking up a package I say "the code is Oscar Foxtrot three-fife" and you see the person processing for a long time to extract the first letter of the word. I've started saying "OF, that's Oscar Foxtrot, 3-5" to help them out.

In other words, asking a customer/consumer to be able to recite something in phonetics is not realistic in most cases.

Fortunately the code already takes this into consideration and removes ambiguous characters.

hprotagonist
1 replies
6h47m

digikey phone personnel all speak NATO. it’s wonderful.

hobs
0 replies
6h4m

First thing I drilled into Apple phone support folks.

mannykannot
0 replies
5h44m

My experience in the USA is that if I don't include the phrase "as in" (as in "X as in Xray") most people still will not realize what I am doing (the alternative "for" can be confused with the digit.)

I also ask them to check my readback of key information they have given me and vice-versa; usually that works well.

jgalt212
0 replies
6h53m

Or just drop the L and O, like CUSIP does.

jagged-chisel
0 replies
7h29m

One can convert those to the correct characters. Since “oh” and “el” are excluded from the alphabetic range, they become “zero” and “one” - deciding whether that is done in software for the help desk, or in the brain of the help desk staff is left as an exercise to management.

inopinatus
0 replies
5h58m

My favourite ambiguous readback is to say "M for Movember".

I have a note from a few years ago that 367CDFGHJKMNPRTWX may be a sufficiently unambiguous alphabet. Drop the one you like the least (probably N) to obtain a faux hex encoding.

8372049
0 replies
7h8m

i, L and o are left out of the alphabet in the snippet, so there's not really any ambiguity.

fnordsensei
2 replies
6h42m

One thing I’ve been looking for in an ID generator is a way to supply a blocklist. There are a number of character combinations I’d like to avoid in IDs, because they might be offensive or get stuck in filters when copy-pasted (e.g. in a URI).

This can be solved in user space by regenerating if the character sequences are detected, but this a) skews the distribution, and b) potentially takes time, especially when the ID generator is made to not be “too fast”. I want to generate a single ID that passes the blocklist in a timeframe that is not too fast, if that makes sense.

Is there an ID generator that takes this into consideration?

ceejayoz
1 replies
6h9m

Just take out the vowels and numbers that can look like vowels. Nixing 0 means no b00bs IDs, and avoids 0/O; I usually take out 1/I as well.

fnordsensei
0 replies
37m

That goes some of the way, but I can think of a few problematic sequences that are only consonants and/or numbers.

ceejayoz
2 replies
6h10m

I like to nix vowels and things that look like them, i.e. 0, to avoid random b00bs sort of tokens.

travisgriggs
0 replies
5h20m

Let’s not get sexist here! It also avoids random d1ck and c0ck sorts of tokens.

madcaptenor
0 replies
25m

Sure, but I can see people still getting offended if "fck" showed up.

inopinatus
0 replies
6h1m

Having a zero in your alphabet can be problematic, because leading zeros are often stripped (e.g. Excel notoriously mangles phone numbers thinking they are integers).

Multiple calls to a randomness generator can be expensive, and a waste of entropy; production-scale random string generators should still respect this and ask for a block of bytes, then encode them, but with bias correction. You're off the hook in this case, I think node's implementation of randomInt is doing exactly that for you and conserving remaining entropy in a cache.

cogman10
16 replies
6h29m

A problem with this approach is it's not monotonical.

Especially if you want to use this thing as an index in a database, you'll run into problems where you try doing middle insertions frequently, which causes fragmentation.

The solution to this problem is making the higher order characters time sorted [1]. You don't need to go all out like uuid, you can have a pretty low resolution. It's more important that new insertions tend to be on the same page. If you have a low frequency insertions then minute resolution is probably good enough. (Minutes since 2000 is an easy calculation).

To implement that here, I'd suggest looking at how base64 or 85 encoders work and use that instead of repeated mods. You can then dedicate the upper bits to a time component and the lower bits can remain random. [2]

[1] https://vladmihalcea.com/uuid-database-primary-key/

[2] https://github.com/mklemm/base-n-codec-java/blob/master/src/...

reissbaker
8 replies
5h18m

IMO it's nice to have two keys:

1. An auto-incremented 64-bit (unless you have a good reason, in which case 32-bit is fine) primary key, used internally for foreign key relations. This will generally result in less index bloat on associated tables, and fast initial inserts.

2. A public-facing random string ID. Don't use this internally (other than in an index on the table it's defined for), since it's large. But this should be the only key you expose to end-users, to prevent leaking data via the German Tank Problem: https://en.wikipedia.org/wiki/German_tank_problem

Only create the second key if this is data you're exposing to users, of course — for data that's only used internally, just use the 64-bit auto-incremented PK and skip the added index bloat entirely.

rav
2 replies
1h18m

Instead of a random string ID, you can devise a fixed secret key and expose the auto-incremented ID xor the fixed secret key as the public-facing ID. This saves you the separate index but still avoids the German tank problem. But it gives you a new problem, namely a secret that's hard or impossible to rotate.

senderista
0 replies
4m

XOR isn't secure enough, but you're on the right track. Instead, use an actual block cipher.

cogman10
2 replies
3h15m

For the number 2, I think one issue is that you are going to be semi-frequently whacking the db to do a mapping of that random string id back to the real id. OK for smaller entities but might be a pain if there's a lot of those ids to wrangle. You can throw a secondary index on it, but that will still have some minor fragmentation issues.

One benefit of a random id is if you are working with more complex data models it can make creating those easier/faster. Instead of having a centralized location to get new ids from (the DB) you can create ids on the fly from the application which can turn the write into a single action from the application rather than a dance of inserting the main table, getting the new id, then inserting to the normalized tables.

caeril
1 replies
1h51m

That's why the Good Lord invented caching. In most applications, 90% of your workload will be over ids less than a week old, so your hit rate is likely to be pretty high for this sort of mapping.

cogman10
0 replies
1h37m

First hit can be a beast. It's workload/entity determinant if caching is enough for this.

Not great if you are spending 1 minute on the first lookup just to do the mapping.

senderista
0 replies
6m

I don't understand why you need to maintain two separate keys: instead of generating a random key, why not just encrypt the auto-increment key using a secret key? This is the approach used by e.g. cloud providers that use auto-increment keys internally but don't want them to be guessable.

WorldMaker
0 replies
53m

A different approach to solve both 1 and 2 is timestamp-oriented IDs. You can get useful cache locality/less "index bloat"/fast initial inserts if your keys can be easily ordered in time. Sorted by timestamp means very similar behavior to B-Tree appends of a monotonic integer, even sometimes in the worst cases where "same moment" IDs aren't monotonic and rely more on random entropy.

I got some great DB cache/index performance from ULIDs with a bit of work to order the ULID timestamp bits in the way the DB's 128-bit column "uuid" sort best supported.

Now that UUIDv7 is standardized we should hopefully see good out-of-the-box collation for UUIDv7 in databases sooner rather than later.

groestl
3 replies
2h41m

A problem with this approach is it's not monotonical

Whether or not that's bad fully depends on your platform and the number of writes you do. If you're using a massively distributed database like Datastore, Spanner etc, you want random keys as to avoid hot spots for writes. They produce contention.

cogman10
2 replies
1h52m

Well, you'd still likely want psuedo-random keys. You'd rather not have the underlying database doing extra work to shuffle around records as the pages get jumbled.

One solution to that is having more complex keys. For example, in one of our more contentious tables the index includes an account id (32bit int) and then the id of the entity being inserted. This causes inserts for a given account to still be contiguous (resulting in less fragmentation) while not creating a writing hotspot since those writes are distributed across various clients.

groestl
1 replies
1h27m

Not disagreeing. Point is, you need to know your domain, your technology, your write patterns, your downstream systems, etc to decide if a specific key scheme works to your advantage or not. All the more reason not to use natural keys, as they lock you in in that regard.

cogman10
0 replies
1h11m

Absolutely agree.

I don't know how you can successfully maintain or develop software without developing an understanding of the underlying domain. I've seen devs try that route and the quality of their work has never been high.

jordanthoms
1 replies
2h42m

This is dependent on the database you are using - if it's a key-sharded distributed database, you want to have insertions evenly spread across the key space in order to avoid having all the inserts go into a single shard (which could overload it)

cogman10
0 replies
2h4m

This is the great thing about using random bits for the lower bits. Because you are unlikely to use more than say 2^64 database nodes, any sharding algorithm will have to figure out how to spread a key with 64 bits (or however many bits are in your key) across n nodes.

Because of the random portion of the key, that means you'll get good distribution so long as the distribution algorithm isn't something stupid like relying solely on the highest order bits.

echelon
0 replies
4h46m

This is a great technical modification that can be made to work with "Stripe"-alike IDs or tokens.

Another hack for advanced active-active situations where you may need to route events before replication completes: encoding the author shard / region in the lower order bytes.

There are lots of interesting primary key hacks for dealing with physical or algorithmic complications.

garblegarble
13 replies
9h16m

This approach will randomly generate profanity. If the ID is visible to users it can cause some to get upset (and best-case simply looks unprofessional). On a purely technical level if visible in URLs it can cause links to be blocked/altered by e-mail filters / filtering proxies.

It's generally a good idea to drop vowels for this reason.

Knufferlbert
6 replies
8h30m

We once had a rather angry Irish customer calling our support complaining that we called him a pikey (slur for gypsy). After some back and forth it turns out we just gave him an apikey.

We never had a similar issue with our random numbers/letters/reset passwords or anything like that which don't have any kind of "dont return profanity" protections. Though I agree, someone getting a randomly generated customer portal url or something containing fuck or similar would look bad. Our cloudfront or something (or was it main public facing s3 bucket? can't remember) starts with "gay" and was never picked up on.

isaacremuant
3 replies
6h17m

I'd say those articles show Wikipedia's political bias and a tendency to overly politically correct instead of portraying reality.

Many in real life do refer to themselves as gitano or Gypsy and would ask for others to refer to them as such.

Of course it's very easy to find an article saying otherwise and then using that as the end of discussion for Wikipedia editors.

coooolbear
2 replies
2h26m

By ‘find an article’ you mean find ~10 real citations including the resolution of an authority a long time ago and to tell the reader it is not clear or definitive?

Better to be careful and let any individuals or communities tell you what they want. I have Roma connections in my family and at one point the word we’d use is ‘gypsy’. But, because I’m not Roma myself, if I came across some other group I wouldn’t assume I’m just allowed to say it to them.

vsuperpower2020
1 replies
2h14m

I don't care what they want. A lot of people are tired of playing these language games.

bigstrat2003
0 replies
42m

Also, "do what people want" is fine for your interactions with an individual. But it's not a viable general rule for language, where we need one single approach. I think saying gypsy unless someone personally tells you they would rather you don't call them a gypsy is perfectly reasonable.

mst
0 replies
6h4m

It's in the annoying category where it can be used as a slur but also gets used as not-a-slur, including but not limited to by the people it describes.

Locally (north west england) people generally use "traveller" as a description ... but there are definitely people who use that as a slur.

Language be like that sometimes.

ThePowerOfFuet
3 replies
8h29m

I hear Scunthorpe is lovely this time of year.

brazzy
2 replies
6h47m

I hear someone got buttbuttinated there recently.

mst
0 replies
6h3m

A problem we've been having since the medireview age.

bigstrat2003
0 replies
44m

At first I thought that this was the result of my cloud-to-butt extension, and I was trying to figure out what "cloudcloudinated" meant.

fregonics
0 replies
7h59m

True! For instance the example itself. "cus" is profanity in Portuguese. If you were to localize the application, this would be a factor.

account42
0 replies
8h55m

fvck cvnt b1tch

mettamage
8 replies
9h0m

Why is such a thing called a slug?

corobo
6 replies
7h42m

Ooh I had this wonder a while back and jotted it down just in case anyone else ever wondered about it:

Comes from the ye olde paper-based blogs they call newspapers. When an article is being put together it’s given a short name, sort of like a project name. This name would remain the same throughout the article’s life - from reporter through to editor - it left its trail through the process. Like a slug.

meindnoch
5 replies
7h23m

That's not what Wikipedia says about its etymology though.

The origin of the term slug derives from the days of hot-metal printing, when printers set type by hand in a small form called a stick. Later huge Linotype machines turned molten lead into casts of letters, lines, sentences and paragraphs. A line of lead in both eras was known as a slug.
corobo
4 replies
7h14m

Ooh interesting. It looks like I have either misinterpreted or found a source that misinterpreted (it was a few years back, unsure if I came to the conclusion or found it). I'll have to update my notes, cheers!

Apologies for the wetbrain hallucination, HN!

8372049
3 replies
6h54m

I'm not sure you're hallucinating. The dictionary I checked lists the printing and journalism terms separately. It's quite possible they have diverging etymologies, meaning both can be correct:

5. Print. a. a thick strip of type metal less than type-high. b. such a strip containing a type-high number or other character for temporary use. c. a line of type in one piece, as produced by a Linotype. 8. Journalism. a. a short phrase or title used to indicate the story content of a piece of copy. b. the line of type carrying this information.

hprotagonist
2 replies
6h40m

or the journalism term itself diverged from the typographic one.

corobo
1 replies
6h36m

Aye this is what it seems to be having double checked the reply's claim

Got to the Wikipedia page https://en.wikipedia.org/wiki/Slug_(publishing) which could possibly support the slimy conclusion of "it's a trail through the process" but that article has an etymology section that refers to the metal slug

I guess it could mean both depending on whether you're looking for the meaning of the word or the meaning of the concept but I didn't find any other slimy grub references (via an admittedly limited double check)

hprotagonist
0 replies
5h22m

another fun etymological rabbit hole for you: stereotype and cliché both probably originated as typographer jargon.

vincentdm
4 replies
9h38m

I don't like the prefix idea: besides the duplication of information, it also becomes a liability if you ever rename things.

Imagine you prefix all customer IDs with `cus_`, but at some point decide to rename Customer to Organization in your codebase (e.g. because it turns out some of the entities you are storing are not actually customers). Now you have some legacy prefix that cannot be changed, is permanently out of sync with the code and will confuse every new developer.

wongarsu
0 replies
8h32m

Reddit prefixes their IDs with t1_ for comments, t2_ for accounts, etc. That sidesteps the renaming issue.

Though I believe they mostly do it because their IDs are sequential, so without prefix you wouldn't easily notice if you use the wrong kind of id. They also only apply prefixes at the api boundary when they base36 encode the IDs, the database stores integers

shawabawa3
0 replies
9h2m

Now you have some legacy prefix that cannot be changed

Yes you can.

You can support the old cus_<ID> prefix as well as the new org_<ID> prefix, but always return org_<ID> from now on

codeulike
0 replies
9h17m

I wouldn't worry about that - I still think its worth it. I've had systems which during development we thought the Contact page was going to be called 'Contact' in the UI but at the end it got re-labelled to 'Individual' but in all the code it was still called Contact and the IDs all started with a C - but you know what? It was still useful to look at an ID, see the C and know that it was an Individual.

chuckadams
0 replies
4h0m

I have that exact issue with a couple different identifiers, and it's not a big deal. Usually it goes along with some data model change you already have to write compatibility code for, the new and old names tend to be related, and the old name tends to stick around other parts of the code anyway. Opaque IDs don't reduce the confusion there, documentation in appropriate places does.

vanderZwan
3 replies
10h32m

Nice!

For the record: the valid chars string is 62 characters, so naively using a modulo on a random byte will technically introduce a bias (since dividing 256 values by 62 leaves a remainder). I don't expect it to really matter here, but since you're putting in the effort of using crypto.randomBytes I figured you might appreciate the nitpick ;).

Melissa E. O'Neill has a nice article explaining what the problem is, and includes a large number of ways to remove the bias as well:

https://www.pcg-random.org/posts/bounded-rands.html

(in this case adding two more characters to the validChars string would be the easiest and most efficient fix, but I'm not sure if that is a possibility here)

naasking
2 replies
6h39m

For the record: the valid chars string is 62 characters, so naively using a modulo on a random byte will technically introduce a bias

Indeed, there's no reason you couldn't just add "_" and "-" or "." as well to complete the set. Your identifier will still be URL-safe. I've been using this type of encoding for years [1] for these kinds of ids to use in URLs, and encoding/decoding is super-fast with some bit shifts. And unlike Base64, you don't need padding characters.

[1] https://sourceforge.net/p/sasa/code/ci/default/tree/Sasa.Web...

notfed
1 replies
1h54m

- and _ tend to break text selection.

naasking
0 replies
20m

I'm not sure what you mean by "break". If you mean that touching or double-clicking on a block of text only extends up to the nearest symbols, that's true. But if your text selection UX is not terrible then it should be simple to extend that further.

That said, iOS and Android text selection have gotten worse recently, IMO.

psychoslave
2 replies
3h28m

To my mind, it always felt so saddening that adoption of a truly straightforwardly readable notation for numbers never took of. I mean it’s so easy to do. You can start for example with a single syllable per digit, and for example only target CV syllables.

From this there is many possibilities, but for example, let’s consider only a base ten. Starting with vowels order o, i, e, a, u with mnemonic o, i graphically close to 0, 1 and then cyclically continue the reverse order in alphabet (<-a, <-e, <-i, <-o*, |-u). We now only need two consonants for the two series of 5 cardinals in our base ten, let’s say k and n.

So in a quick and dirty ruby implementation that could be something like:

    $digits = %w{k n}.product(%w{o i e a u}).map{it.join('')}
    def euphonize(number) = number.to_s.split('').map{$digits[it.to_i]}.join('-')
    euphonize(1234567890) # => "ki-ke-ka-ku-no-ni-ne-na-nu-ko"
That’s just one simple example of course, there are plenty of other options in the same vein. It’s easy to create "syllabo-digit" sets for larger bases just adding more consonants, go with some CVC or even up to C₀C₁VC₀C₁ if sets for C₀ and C₁ are carefully picked.

psychoslave
0 replies
3h20m

And of course a naive implementation of the reverse is also trivial:

   def numerize(euphonism) = euphonism.split(?-).map{$digits.find_index(it)}.map{it.to_s}.join.to_i

jmcphers
0 replies
2h8m

There is a old system for making numbers pronounceable as words using a mapping from each number to a consonant value. It's typically used to help memorize numbers:

https://en.wikipedia.org/wiki/Mnemonic_major_system

inopinatus
2 replies
8h41m

I'm not convinced that all Stripe IDs are wholly random strings. I just decoded the base62 part of four genuine "acct_" objects, which at 16 characters are just shy of representing a 12 byte value (log2 62^16 =~ 95.3), and they all have a leading byte of "00000011" and two of them even have a leading 32 bits that is very suspiciously close to an epoch timestamp of a couple of years ago.

There is a similarly suspicious pattern in some of their longer identifiers, invoices for example at 24 characters (ca.143 bits) all seem to have a first byte of "00000010".

Even in the article you've linked to, look closely at the IDs:

      pi_3LKQhvGUcADgqoEM3bh6pslE
      pm_1LaXpKGUcADgqoEMl0Cx0Ygg
     cus_1KrJdMGUcADgqoEM
    card_1LaRQ7GUcADgqoEMV11wEUxU
Notice the consistently leading low-integer values (a 3, then three 1s)? and how it's almost always followed by a K or an L? That isn't random. In typical base62 encoding of an octet string, that means the first five or six bits are zero and the next few bits have integer adjacency as well. It also looks like part of the customer ID (substring here, "GUcADgqoEM", which is close to a 64-bit value) is embedded inside all of the other IDs and then followed by 8 base62 characters, which might correspond to 48 bits of actual randomness (this is still plenty, of course).

Based on these values it seems there's a metadata preamble in the upper bits of the supposedly "random" value, and it's quite possible that some have an embedded timestamp, possibly timeshifted, and a customer reference, as well as a random part, and who knows maybe there's a check digit as well.

It's possible - albeit this is not analytical but more of a guess - that the customer ID includes an epoch-ish timestamp followed by randomness or (worst case, left field) is actually a sequence ID that's been encrypted with a 64-bit block cipher and 32 bits of timestamp as the salt, or something similar (pro tip: don't try that at home).

My view is that either Stripe's engineering blog is being disingenuous with the truth of their ID format, or they're using a really broken random value generator. If the latter, I hope it's only in scope of their test/example data.

dividuum
0 replies
8h13m

I took a look at a bunch of stripe customer ids I have stored and at least mine look very random on first glance. I assume their blog post uses demo keys or something similar.

giancarlostoro
2 replies
5h52m

If I'm going to do that I think I'd use Bitcoins BASE58 which avoids letters that could be confused for each other. The number of times I see an O and 0 and wonder which is which, because the font does not make it clear really annoys me.

Edit:

Other honorable mentions: ObjectID's as used by MongoDB which contain the creation timestamp. Also Discord's snowflakes (inspired by Twitter's iirc), which also contain the creation timestamp.

kibwen
1 replies
5h8m

If you want random IDs to be human-readable (and human-communicatable), I'd just recommend base 32 or even base16. You don't actually save that many bytes from base58 or base64 when it comes to short IDs.

Case in point, the parent poster's base64 ID is 14 characters long. When encoded as base32 that's still only 17 characters (or 19 in base16), and now you have completely gotten rid of all notion of casing, which is annoying to communicate verbally.

giancarlostoro
0 replies
4h38m

I think the more I think of it, the more I favor the Discord snowflake ID's because they're just integers, and they can be generated on the fly. I think new messages generated them on the client if I'm not mistaken.

phkahler
1 replies
4h43m

Something like that should have a built in check "digit" if people are going to see it and possibly type it in.

For numeric values, making them all a multiple of 11 is a simple way to catch all single digit errors or single transpositions.

chuckadams
0 replies
4h7m

That's why VINs make a decent natural key, because they do have a check digit. Plus they're not opaque: if you look up the VIN and the make/model/year is completely different than the car in front of you, you know you either have the wrong VIN or the wrong car.

codeulike
1 replies
9h19m

Except the Stripe ones are case sensitive which can be annoying with some databases

datavirtue
0 replies
4h26m

You know what's annoying? People storing GUIDs in a case-insensitive database.

whyever
0 replies
10h8m

Please don't use % to generate integers from a range, it's not uniform, which can be disastrous if you rely on your numbers not being predictable. You can use crypto.randomInt instead.

scrollaway
0 replies
6h56m

I don’t recommend using random bytes for this. What I have done in my previous projects is take a uuid6, reserve a couple of bytes inside of it to replace with an object ID, and I convert that to obj_XXXXXXX.

This means you can store them in the db not as a string but as a uuid, which is a lot more performant. You also get time stamping for free.

cimnine
0 replies
5h18m

Yes, and I like to combine two established concepts instead of rolling my own: URI and UUIDv7. So my IDs become `uri:customer_shortname:product_or_project_name:entity_type:uuid`. An example ID could be `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7f4196`. It's even possible to cascade such IDs, for example: `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7f4196:line_item:018fe882-43b2-77bb-8050-a1139303bb65`.

It's immediately clear, when I see an ID in a log somewhere or when a customer sends me an ID to debug something, to which customer, system and entity such an ID belongs.

UUIDv7 is monotonic, so it's nice for the database. Those IDs are not as 'human-readable' for the average Joe, but for me as an engineer it's a bliss.

Often I also encode ID's I retrieve from external systems this way: `uri:3rd_party_vendor:system_name:entity_type:external_id` (e.g. `uri:ycombinator:hackernews:item:40580549:comment:40582365` might refer to this comment).

bambax
0 replies
8h47m

Nice, but better to remove ambiguous / hard to read letters: ijlo, IJLO, and 01 (and maybe 7 as well?)

Kon-Peki
35 replies
15h0m

how about a personal identification number? In Denmark we have the CPR number, and I understand that the US Social Security Number is vaguely analogous.

The US SSN is not guaranteed to be unique, the SSN assigned to a person could change, there is no guarantee that a person with an SSN assigned to them is a US citizen, and there is no guarantee that a US citizen has an SSN - they must be requested, and you don’t need one unless you do something that requires having one. There are also things called ITINs and ATINs that look like an SSN but are not, yet can be used in place of an SSN in a huge range of SSN-required situations!

(Please don’t use the SSN as a database key!)

delusional
9 replies
11h41m

We have the same problem in Denmark, most people just don't realize it. At my dayjob we get at least one person every year who changes gender and consequently gets a new SSN (the final digit is supposed to signify gender). Most people don't store SSNs so they never realize, but it does happen fairly frequently.

winternewt
7 replies
10h41m

If Denmark is anything like Sweden there's also:

- SSN from different ID space gets assigned to immigrants; when they become citizens they are assigned a new permanent SSN.

- SSN:s have a long and a short form; the short form which cuts off century information can be the same for someone who is 5 years old and someone who is 105 years old.

- When an unconscious patient comes in to the E.R. you don't know their SSN, so a temporary one is assigned for use in patient records. Such temporary SSN:s are not coordinated nation-wide so multiple patients may have the same SSN. In some hospitals they don't even have a local standard for ID:s. The staff just makes something up on the spot. It happens that the SSN they made up collides with a valid SSN for another person.

kassner
4 replies
7h22m

Also: the personnummer carries a date that often means birth date, but there are cases where it’s not, but I’ve seen a few system that just assumes it’s the same.

SSN from different ID space gets assigned to immigrants; when they become citizens they are assigned a new permanent SSN

Having gone through that, my personummer didn’t change. Maybe that doesn’t happen anymore?

tapland
3 replies
5h51m

It does, but if you already have a Personnummer or get a residence permit right away so that you are eligible for one you don't get the temporary Samordningsnummer.

kalleboo
2 replies
5h30m

And the samordningsnummer isn't only for immigrants - it's also for Swedes born abroad who have never been folkbokförd.

tapland
1 replies
5h9m

They're still still an immigrant. The numbers are for residents (at some point in time) and citizenship isn't reflected by it.

kalleboo
0 replies
2h37m

Interesting, I've never heard of anyone referring to a citizen as an immigrant, but looking up the legal definition, they certainly are. TIL.

mrweasel
0 replies
6h45m

Immigrants and refugees will get a "replacement CPR number", which is outside the normal space/range of CPR numbers. Once registered as living in Denmark, they'll get a real CPR number.

Danish SSNs doesn't have a long or short form, you used the seventh digit to do a table lookup to see if the person is born in 18XY, 19XY or 20XY. The date of birth is always ddmmyy, there is no long form. So if the seventh digit is 9, and the 5-6 digit is between 00 and 36, then you're born in between 2000 and 2036, if the 5-6 digits are 37 to 99, then your born in the 1900. But you need the published table to figure that out.

Last point, there is backup system for unconscious patients, but it should be the same across all medical records as these are somewhat standardized.

delusional
0 replies
5h56m

- SSN:s have a long and a short form; the short form which cuts off century information can be the same for someone who is 5 years old and someone who is 105 years old.

We don't do that. instead, we shove that extra bit of information into the digit following the last two digits through a table: https://da.wikipedia.org/wiki/CPR-nummer#Under_eller_over_10...

rsynnott
0 replies
6h41m

In Ireland, until the 90s, if a woman got married, she gave up her PPS number (essentially a social security number) and took her husband's, with a 'W' appended (this fact tells you a lot about pre-90s Ireland...) If she subsequently divorced or the husband died, she got a _new_ PPS number.

While this was abolished in the 90s, _some people still have these 'W' numbers_, ticking timebombs for anyone relying on them as a key.

mikedelfino
8 replies
6h48m

How to uniquely identify an American citizen?

Longhanks
2 replies
6h21m

How about not doing that?

mikedelfino
1 replies
5h40m

Then I guess it could lead to duplicate records. Now, whether this is a problem or not depends on the business and how much one cares about data integrity. I work with higher education, so uniquely identifying students and keeping all their records organized is somewhat important. Granted, I don't work in the US, and here we have a unique, national number. So this is covered, except for foreign students. I was curious how this problem is solved in other countries.

Kon-Peki
0 replies
5h2m

The US government doesn't assume that it can uniquely identify a citizen with 100% certainty. When things get tricky, we rely on the judicial system to weigh the evidence and make a decision. Which could later get changed.

If you want to design your system to assume that you can do it, that's your problem. Literally, it is now your problem, and nobody is going to step in to help.

vharuck
0 replies
5h55m

The population health database I work with uses generated IDs for people and relies on health systems to link records before submission (they put a lot of effort into data used for billing). But we do check for problems by looking at duplicates of SSN with date of birth. We don't consider names, because they could be transcribed differently. Even though the SSN+DOB pair should be unique, there have been cases where a widow provides her deceased husband's SSN. Likely because she used it for Social Security benefits and forgot her own long ago.

polygotdomain
0 replies
5h25m

To give a slightly more technical answer, at a large insurance company I used to work for, the legal department had provided definitions on what conditions we would consider various components of PII a match. So between SSN, DoB, First Name, Last Name, and a couple others, there were potential combinations that our system would say, "yes this is the same person". Note that we didn't necessarily need exact matches on things like names, but "close enough" matches were sometimes sufficient.

j16sdiz
0 replies
6h17m

You can't.

That's why everybody use SSN.

hprotagonist
0 replies
6h37m

ha ha only serious; ask them for their papers and if they say “who the fuck are you?”, they’re americans!

a slightly more serious answer: you largely don’t. The US doesn’t have a national ID, proof of birth is not even close to standardized, etc.

gwbas1c
0 replies
5h16m

You can get kind-of close with driver's license numbers. They will change when US citizens move among states, (and we do that quite frequently.)

(People who don't drive have to get state-issued IDs that have the same number. At least, they have to if they want to buy do anything that requires proof of identification.)

selcuka
6 replies
14h22m

The US SSN is not guaranteed to be unique

The cases you listed do not mean SSNs are not unique, unless there are people who share the same SSN. You can still define a unique index for the SSN column. A column can be both nullable and unique as each null is different in SQL.

pseudosavant
0 replies
12h56m

Funny enough, I used to work at IDA.

I’ll add that there is a huge difference between the SSN database that the Social Security Administration maintains, and the list of SSNs that have been associated with a person. Especially because it is very common to change a single digit of your SSN when performing credit fraud - because they’ve already burned their real one. Some people will have dozens of SSNs attached to them.

IDA was very good at determining who a person is through the graphs that represent identities in our world (names, DOBs, phone numbers, addresses, SSNs, etc.)

Kon-Peki
1 replies
14h18m

EDIT - SSA claims that numbers are not recycled. But there are known cases where the same number has been assigned to multiple people.

Note that in less than 100 years, more than half of all possible SSNs have already been used…

bloopernova
0 replies
5h45m

So we need SSNv6? ;)

cmiles74
0 replies
5h57m

People also type them in wrong, so there's another thing: where is the data coming from.

ars
0 replies
14h18m

unless there are people who share the same SSN

That happens all the time. First of all people steal SSN's and use them (and you are not the police, so it's not your responsibility to do anything about that). Second people make up fake SSN's because they don't want to give you their SSN.

People also make typo's, and you can end up with the same SSN.

An SSN is not unique in the real world.

otherme123
6 replies
12h40m

In Spain we have the DNI number, that a lot of people asume is unique, even database designers that use is as a natural key.

Turns out the DNI can have, and actually have, a lot of duplicates. The police has a page explaining it (https://citapreviadnipasaporte.es/dni/dni-duplicados-espana/), and how it's not a primary key in their databases, but a number entered manually from a pool of possible numbers. And number re-using is a possibility. They estimate the number of duplicates in 200,000 for a population of 50,000,000.

The point is that if you asume DNIs are unique and use them as PK your database is exposed to the bad design of the DNI database. There are some stores that use the DNI as the "unique" identifier for fidelity cards.

prmoustache
3 replies
9h54m

I've seen banks or insurers use DNI as user login.

benhurmarcel
0 replies
8h36m

All of them do I think. But from experience, they can change it for your account.

SoftTalker
0 replies
1h23m

The user login (hopefully) is not the internal primary key for the user. It should be unique at a given point in time, obviously, but certainly there are reasons it might need to change.

Piskvorrr
0 replies
1h6m

And I've been bitten by this design numerous times. The most common bug: literally the birthday paradox, i.e. "these two people have been assigned the same number, and now we need to distinguish them in our database".

steinsgatezero
0 replies
6h2m

I remember having to deal with this and other identifiers like NIF and NIE while working on the academic titles homologation platform for the MCIN. I didn't understand why it wasn't used as, not necessarily PK, but an identifier logging in. Thanks for letting me know that the hellish time spent integrating Cl@ve wasn't in vain.

benhurmarcel
0 replies
8h35m

Also your DNI can change. Typically foreigners get a NIE (used for the same thing but a different format), and get a new DNI if they ever get Spanish nationality.

pradn
0 replies
6h28m

Pernicious assumptions!

Google Cloud projects have three attributes: user-friendly names, system numbers, and system names. System names are alphanumeric. They can be chosen by the user, derived from the friendly name if there's no collision.

But! There's some system names from the olden days that are actually all numbers - so not actually alpha-and-numeric. Thankfully we don't run into those often.

ajuc
0 replies
1h59m

Yeah, these things are almost never as simple as they are supposed to be.

Poland has PESEL numbers since 70s. It was supposed to be unique, only apply to Polish citizens, never change, and have a checksum digit. Every Polish citizen gets one at 18 when they get their national ID document, and you can request it earlier if you want to.

Turns out there are duplicated PESEL numbers. A LOT of non-Polish citizens have them assigned (mostly Ukrainian refuges but not only). The checksums are sometimes wrong. And some people have several PESEL numbers.

If you used PESEL as database key you're fucked.

The system works perfectly, but it interfaces with external world through computer-human-paper-human-computer interface. And at some point the mistake propagates so far that it becomes the truth assumptions be damned.

nostrademons
19 replies
5h53m

There's a better solution for many of the exceptional cases that the author describes: aliases & audit logs.

Take for example the Danish CPR number. That's perfectly fine as a natural key; its definition is the first CPR number assigned. If a person's CPR number changes because they've changed their gender, you will want a separate table recording a.) the date of the change. The new CPR number is not valid before that time b.) the new gender c.) probably the reason for the CPR number change, since if the policy now is that they can change because of a gender change, there's a decent chance they'll be some other policy in the future that results in a new CPR issuance.

Or the chassis number. Also fine as a natural key. If it's changed because of a data-entry error, you also want to record a.) the date of change b.) who changed it. This opens up a whole host of auditing, monitoring, and reporting functionality that eg. lets you catch fraud, determines if a single person is being sloppy, identify mass changes in policy, notify and update external records of owners, etc.

URLs are another good natural key: they are defined to be unique (otherwise your webserver won't work), they make for very easy lookups when you're fetching from a web request, and if they change, they break the web. Except that they do change. But when a URL changes, you don't want to just update them in the database everywhere, because again, that will break the web. You want to leave a redirect from the old to the new one. So you create a redirects table of all the other aliases that point to a given page, use it to generate server redirects, and you can throw in other data like the time of change or hit counts on each individual alias.

phkahler
6 replies
4h46m

> If a person's CPR number changes because they've changed their gender, you will want a separate table recording a.) the date of the change. The new CPR number is not valid before that time b.) the new gender c.) probably the reason for the CPR number change, since if the policy now is that they can change because of a gender change, there's a decent chance they'll be some other policy in the future that results in a new CPR issuance.

But what if you have a bunch of records in some other table - billing information for example, and it's indexed by the CPR number (a foreign key). When they change the CPR number you can no longer query for their entire billing history based on CPR. None of your proposed extra complexity does anything about this problem. The only good way to solve it is to use a synthetic key with NO meaning. It would still be good to do as you say and track all the CPR number changes for a given person, but they will still need a unique key. So a sort of "identity table" used to figure out what unique key you're dealing with.

wild_egg
2 replies
4h40m

Why could you not join with the audit table and find historical billing information from the old CPRs?

phkahler
0 replies
3h11m

> Why could you not join with the audit table and find historical billing information from the old CPRs?

1) complexity and 2) what is actually tying the CPRs together? We're not going to have a CPR table per-customer, so all the CPRs of every customer are in the same table. Presumably the CPR table has a unique key for the customer that can be used to associate multiple CPRs with that person, so we have come full circle - just use that unique key in the audit table.

lolinder
0 replies
4h29m

You could, but that introduces a lot of complexity that would be saved by just giving the customer a synthetic ID that uniquely and stably identifies them in your system.

The grandparent's proposal basically turns the first-entered CPR into a meaningless ID field that should not generally be used in a piece of business logic (unless for some reason you need to display "first CPR they entered"). Once you've declared that you should look elsewhere if you actually need the CPR, why did we even bother using it as the primary key?

nostrademons
2 replies
2h22m

You need this complexity anyway - if you had surrogate keys, the primary key would be an opaque identifier, and to do anything involving the CPR, you'd need to join against your CPR table (which, again, needs to be 1:many because CPRs themselves are not a 1:1 relationship).

The first CPR in this case becomes identical to your surrogate key - it's an opaque ID that you use to reference other tables in the DB - but with the added benefit that for the common case, you don't need any additional lookups. You only need to lookup CPR changes if you don't find the CPR that the user gave you. And then it has other added benefits in that you have a record of CPR changes, you can understand how common a case this is, the CPR change table itself has semantic meaning and you can query a wide variety of properties without joining your primary user table, etc.

shkkmo
0 replies
2h5m

but with the added benefit that for the common case, you don't need any additional lookups. You only need to lookup CPR changes if you don't find the CPR that the user gave you

Doing the CPR lookup ONLY when you don't find a record is really, really stupid. All you need is one bad cache, one database update that go rolled back and suddenly you've lost all recent updates the the user and will only see the bad record with no indication you've failed.

Trying to use force changeable data into being a 'natural key' means that the number of edge cases you have to predict, write around and test for is going to rise significantly for no real benefit.

lolinder
0 replies
1h15m

The first CPR in this case becomes identical to your surrogate key - it's an opaque ID that you use to reference other tables in the DB - but with the added benefit that for the common case, you don't need any additional lookups. You only need to lookup CPR changes if you don't find the CPR that the user gave you.

I'm not buying that this is a meaningful gain in either performance or code complexity.

In the world of synthetic keys I look up the CPR in the CPR table and join it to the user table using the synthetic ID. If I find a record for the CPR+user join then I'm set, if I don't then the customer doesn't exist.

In the world of natural keys, you're advocating that I first query the user table directly by CPR. Then if I turn nothing up I run a separate query with a join on the original CPR. Then if I still don't turn something up the customer doesn't exist.

The code in the second instance is obviously more complicated than the code in the first instance. It has increased risk of someone writing a bug because now there's a very tempting CPR field that will be right most of the time but wrong in some edge cases. Depending on the database and usage patterns, indexing on the CPR may be much less efficient than indexing on an autoincrementing integer.

The only thing it has going for it is that I might be able to avoid a single join on a very specific path where a user is looking up a customer by typing in the CPR. That seems like the wrong thing to optimize for in the face of all the downsides.

trevor-e
2 replies
4h15m

URLs can be tricky and have plenty of gotchas depending on what you're trying to do. For example, the order of query params is free to change but it's still the same URL. Nothing that can't be worked around with a little normalization.

Karellen
1 replies
1h57m

For example, the order of query params is free to change but it's still the same URL.

Are you sure that's guaranteed by any spec? I thought an end-point would be free to treat `?a=1&b=2` and `?b=2&a=1` differently. I mean, it would be a nightmarish implementation, but I don't think it would be non-conforming?

Piskvorrr
0 replies
1h14m

It's a mere convention: "a querystring of this type is an array of parameters, the order of which is irrelevant" - this is one of the tautological "valid except when it's not" non-rules (a surprising number of cases).

Example: ?a=1&b=2&a=3 - will the server treat this to be equivalent as ?a=3&b=2&a=1 , or ?b=2&a=1 , ?b=2&a=3 , or something else entirely? You'd need to check the serverside parsing implementation to be sure - those could even be valid (and distinct) filenames FWIW.

(And that's before you get to caching - "?b=2&a=1 is not to be served as a cached version of ?a=1&b=2")

tapland
2 replies
5h22m

It's not really fine. I work on a Healthcare system in the nordics.

Who you billed, who visited what doctor, who your primary care provider is, all the people a doctors office has as patients, refers to the SSN.

You don't want to lose that connection or to have to update everything for any change.

You store a unique identifier for the person in the system, and you can then pull the actual personal identification number when needed.

You do not keep individual private lists of people changing genders.

tapland
0 replies
4h48m

You do not keep individual private lists of people changing genders.

On this note: Some organization in Sweden actually does store changes like suggested in the top comment. Both the tax agency and police were interested in knowing who did, and who had access, since it's the cause of recent mass doxxing events.

nikita2206
0 replies
2h22m

But there’s no natural key for a person, not even the SSN. This article also starts off by showing an example where no natural key exists (Restaurants) and acknowledges it.

shkkmo
0 replies
2h26m

Take for example the Danish CPR number. That's perfectly fine as a natural key; its definition is the first CPR number assigned.

This is a horrible idea. You now have two different pieces of information that are identical in form and indistinguishable in the majority of cases: "first CPR number" and "current CPR number. Every place you enter, or use a CPR number, you now must track which piece of information you have. If you make a mistake doing this, it will be hard to catch. Now you've decided that first piece of information is a natural key and will be every single place the record is used, even if that spot doesn't do anything specific to the CPR number. Every single time a CPR number is ingested from an exterior source, you need to do a lookup to make sure it is the original CPR number and then track it. Ever place you don't do this lookup is a place where an error can creep in if something changes in your pipeline our source.

Even in an context where external sources are using something like a CPR number as an key, I would still use a different key internally since I see only downsides to using a "natural key".

URLs are another good natural key: they are defined to be unique (otherwise your webserver won't work), they make for very easy lookups when you're fetching from a web request, and if they change, they break the web

URLs are also horrible natural keys. They are not defined to be unique and provide no guarantee that the content has not changed or even that the same content is sent to different users.

If the location for content changes, you may or may not get a redirect. If you do get a redirect, you'd have to now go update every single place that uses the key to the new value. It is much better to map URLs to an artificial key and update that mapping in a single place.

lolinder
0 replies
5h13m

I'm a fan of audit logs, but what you're describing is more event sourcing than audit logs. From what I understand what you're suggesting is that you don't change the key in the original table, you just record the change in a second table. Presumably, then, when you need to read a row you must also look up the list of diffs for that row to make sure that the natural key hasn't changed.

Two things strike me about this proposed model:

First, under your proposal the version of the model that you work with at the application layer must have two copies of the key field—one is the database key for when you need to make changes or look up more data and one is the meaningful business-model field that's actually up to date. That's exactly the same extra mental overhead that natural keys are supposed to have solved, only worse because the fields will have similar names and will contain the same content most of the time, making it easy to accidentally use one where the other was expected.

Second, if you're going to introduce effectively an event-sourced data model then you've introduced a ton of new database records already, so why not just give everything a proper unique key while you're at it? Once you've done that you can modify the original row after all (while retaining the audit logs!) or, if you're serious about event sourcing, cache the latest derived value and look it up by its database key instead of carrying around an out of date natural key that's just waiting to be used incorrectly.

crote
0 replies
1h36m

Take for example the Danish CPR number. That's perfectly fine as a natural key; its definition is the first CPR number assigned.

The problem is that the new CPR number is now the one you want to use for all display purposes and future interactions with external systems. In other words, you can't use the "original CPR" field for anything except as key. It's no longer a CPR field, because it no longer has any relation to the person's CPR!

And at that point it'd be better to just use a GUID or something as key and avoid any potential confusion between the "real" CPR and "fake" CPR, because when the two are the same 99% of the time it is guaranteed to cause a shitton of bugs.

The only solution is to essentially rewrite all your records with the new CPR as key, and leave a redirect entry at the old CPR. That's pretty much what happens in Sweden when you change your gender: your old identity ceases to be and you're issued a completely new one.

Pengtuzi
0 replies
2h32m

URLs are another good natural key: they are defined to be unique

No, see more info here: http://localhost:8080/

IggleSniggle
0 replies
2h4m

URLs are only really uniform not necessarily unique. The most obvious case is something like `http://localhost/file.txt`, where the resource being located is almost sure to be different on every single "localhost," but this is true of any server. The pointer is neither unique nor has any true guarantee that the resource being pointed to can or should be considered unique for any given context. It is merely unique within its calling context, which, when dealing with URLs, is often anywhere on the web where DNS responds.

Even if you presume that the resource being located is in fact a unique resource in the general case, the "unique resource" may not be unique in the way that you presume. Some URLs at a given location are intended to be idempontent and cache-able, others are not, and many are time-limited forwarders. And there's no guarantee or even expectation that two identical forwarding URL's will resolve to the same location; it may well be network-topography dependent.

ray_v
12 replies
15h42m

Feels like this could have used a few more solid examples up-front. I think another good example would be PlayStation Network using the natural key of "gamer tags" as the primary key to identify players would be a good example. Since this effectively locks players into having to keep a gamertag in order to uniquely identify them in the service -- instead of having a synthetic key that carries no meaning or data other than "uniqueness"

mths
4 replies
15h15m

Last I checked, Steam still has me logging in with my two decade old hotmail address as my account name. At least it's not something that shows publicly, I think.

petepete
2 replies
13h51m

Mine too, but with a British ISP that hasn't existed for fifteen years.

t90fan
1 replies
13h25m

ntl?

me too!

petepete
0 replies
11h1m

That's the one! myinitials@ntlworld.com

And NTL merged with Virgin in 2006, making it 18 years. Oh dear.

nottorp
0 replies
6h59m

I wouldn't swear it, but I think when I changed my Steam email the process didn't require access to the old email, just knowing the user and pass.

robomc
3 replies
15h23m

yeah the way he described it it's like... well who would ever do that. but foreign keying to emails or usernames is much easier to "accidentally" do and is a classic source of long-term headaches.

jaggederest
2 replies
13h52m

As a great example: My steam account name is the email I was using in 2003. I have largely not used it since then. The email on the account has been updated, but the account name? Stuck.

jimbobthrowawy
1 replies
9h35m

I was about to ask how you got past steamguard without email access. Makes sense if you were able to navigate off it.

jaggederest
0 replies
2h38m

Email was changed about 5 years before steamguard release :)

saghm
0 replies
11h42m

On the other hand, Discord used to allow arbitrary usernames and would add a suffix for when you needed to disambiguate (e.g. if you used the username JohnDoe, your "id" would be something like JohnDoe#12345) but over the past year or so forced everyone to pick a fully unique username. In this case, the decision seemed to potentially be financially motivated, given that people with subscriptions were given priority for claiming usernames, and historically there really hasn't been much reason to pay for it in the first place.

bux93
0 replies
6h42m

A phone company I was a customer of used my phone number as the customer Id. Which doesn't work great for people, or companies, with more than one phone number. They could also only provision one SIM per phone number, and not use a phone number tied to a SIM as a virtual number, and also the primary (customer Id) phone number had to be tied to a SIM.

I ported my phone number away from them, but what if I ever go back? Will my old data be there? Including my old address? What if my phone number gets recycled, and some-one else gets that phone number and ports it to them? So many questions.

bruce511
0 replies
12h41m

This is one of those cases where examples of natural keys failing are so ubiquitous that they are almost redundant.

For the group who have forged a career with natural keys, and never regretted it, more power to you. Great.

However to the rest of us, myself included, where ill-considered natural keys have caused endless opinions and suffering, my commiserations.

If I could send back one piece of advice to junior-me it would be to avoid natural primary keys. (Ideally with the corollary to avoid sequences, but that's another thread for another day.)

kstenerud
12 replies
10h2m

In databases, never rely on data you don't control. "Natural" keys are an example of this.

Names can be natural keys, but you don't control them. You don't control when or how a name changes, or even what makes a valid name.

Addresses change. Or disappear. Or somehow can't be ingested by your system suddenly.

Official registration numbers (SSNs, license plate numbers, business numbers etc) seem attractive, but once again you don't control them. So if the license plate numbering scheme changes in some way that breaks your system, too bad. Or people without an SSN. Or people in transition because an SSN needs to be changed in a government system somewhere. Or any other number of things that happen in a government office that affect you, yet you have no control over.

Phone numbers? Well, we've already seen that mess with many messenger platforms.

Fingerprints? Guess what? They evolve over time, and your system will eventually break.

Retrofitting a system that relies on "natural" keys that have broken SUCKS.

Use a generated unique key system that YOU ALONE control.

The first rule of software design is: Don't try to be clever. You're not clever enough to see all of the edge cases that will eventually bite you.

jimbobthrowawy
5 replies
8h52m

Licence plate numbers are an interesting one, since what those mean varies from country to country. Here (Ireland), they are assigned to the car itself via VIN, are never meant to change once assigned, and are backdated based on information about the vehicle itself (e.g. year of first registration even if first registered in another country, following an older format if applicable), but in other countries they can be reassigned to other vehicles.

e.g. 06-LK-12345 might be your car's plate if you move your foreign 2006 car here and register it while living in Limerick, but buying a new car might give you a plate like 242-L-12345 since the format of the first two fields changed since. If you leave and later return with that car, re-registering it gives you the exact same number.

https://en.wikipedia.org/wiki/Vehicle_registration_plates_of...

akira2501
4 replies
8h44m

You don't have vanity plates in Ireland?

ThePowerOfFuet
2 replies
8h31m

Many, many countries do not.

cheema33
1 replies
8h4m

Vanity plates are essentially free money for the govt. I am surprised every govt. body doesn't go for it. If you really don't like them, make them super expensive.

rsynnott
0 replies
6h38m

Tbh when you factor in the administration, I'd question whether they actually make any money on these.

Macha
0 replies
8h7m

There aren't any, the closest is that some authorities reserve low sequence number plates for e.g. the local mayor

andrelaszlo
2 replies
8h44m

Official registration numbers, such as Swedish personal identification number, or "personnummer" (date of birth + serial + checksum [Luhn], where even serials are used for females and odd for males):

- It can take a few days before a newborn is assigned a number

- Non-citizens don't have one, but they can get a coordination number on the same format but with the date part incremented by 60 days.

- Citizens can have both a coordination number and a personal identification number in certain cases.

- They can be changed if the wrong birth date or gender registered at birth or during immigration, for protected identities, or for gender transitions.

wasmitnetzen
0 replies
3h23m

Nitpick: long-term residents get a personnummer as well, not just citizens.

Aaargh20318
0 replies
8h26m

In the Netherlands we have something similar: the BSN (translated: citizen service number). This used to be the social security number, but because it was used by an increasing number or government agencies unrelated to social security it was changed to BSN.

One of the major problems with this number is that it has a special status under the law. There are very strict rules as to who can process and/or store this number and for which purpose. For example: a bank can process this when opening a bank account, under anti money laundering regulations, but they cannot use it to identify an existing customer.

If you originally set up your database to use SSNs you now have a problem. This actually happened with our chamber of commerce: if you registered a one-man business they used the SSN as the business id and you’re required as a business to publish this. Now it’s suddenly a number that is subject to strict privacy rules and they have to renumber all one-man businesses.

So that’s another problem with data you don’t control: the legal status of this data can change.

wvh
1 replies
8h57m

I agree with your general point, but it might still be cheaper to redesign your system if a rare breaking change happens to usually reliably stable external registration systems, rather than to pay the cost of a thousand paper cuts by indirection, computational cost or human confusion.

One edge case and good indicator is if your system is testable by itself without third party cooperation and the ability to, for instance, create license plate numbers.

kstenerud
0 replies
7h3m

Every system that uses SSN as the primary key is fully testable by itself without third party cooperation. And yet every one of these systems has required workarounds because what the designers thought to be invariants, weren't.

And this is ALWAYS the case with identifiers that you don't control. You don't make policy decisions about them, but SOMEBODY does. And that somebody isn't aware of - and wouldn't even care - about the invariants that you ASSUMED they followed (and maybe they really did follow them, but they sure don't anymore! Oops...)

Fixing broken invariants after the fact is always a nightmare, because it only comes up once you get stuck - either you can't enter something into the database that you absolutely MUST by the weekend, or you can't change something in the database that you absolutely MUST by the weekend. So you do some last minute hacks to get things kind of working, and then it works for awhile until the next problem (usually involving your hack).

It's hardly any extra work or complexity to just use an ID generator for the primary key. You'll still have the same indices, the same foreign key linkages etc. You have no reason not to do this.

And yet somehow people always seem to fall for the "Oh cool! This existing ID does everything we want! Let's just use that instead of adding one more field to the table! I'm so clever!"

nailer
0 replies
8h13m

I was this recently. I moved to US in November and it was February before I had an SSN. A bunch of companies had to put in fake SSNs into their system, which they have a standard for.

wavemode
11 replies
14h38m

Another massive annoyance with natural keys - privacy. If your table's primary key contains personal information, that PII now infects every other table that holds a foreign key to that table.

gregw2
6 replies
8h52m

Thank you for your comment. It spurred me to think about this issue more thoroughly in a way I hadn't, even though this comment may disagree -- to some extent -- with your perspective.

I have not seen clear guidelines about whether an organization's surrogate keys for persons are considered PII. (And this ambiguity has frustrated me for some time as I am unclear whether to take an aggressive or conservative view on labelling PII where I work.) When I have read the guidelines, it seems ambiguous but on balance I think it disagrees with your implied claim (that PII is not present in a table that uses a surrogate foreign key to a person/user.)

The definition of PII per NIST https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpubli... is:

"Any information about an individual maintained by an agency, including (1) any information that can be used to distinguish or trace an individual‘s identity, such as name, social security number, date and place of birth, mother‘s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information."

A surrogate key associated with a person is arguably "any other information that is linked or linkable to an individual" meaning that all tables containing the surrogate key remain PII and remain "infected". It's true the surrogate key only allows linkability within the context of the data ecosystem in which it resides, but such distinctions (of "internal" to the system vs "external from the system") are not made in the language of the definition. Additionally, from a pure risk and PII disclosure impact, usually the whole database gets dumped, not just the "non-person" tables. If you have financial/medical transactions in table "A" and a personal numeric ID linking data to a person table "B", both tables contain PII, right?

From a privacy standpoint, if you can SQL JOIN the data to trace the person involved either within your system or even data reasonably obtainable outside your system, (or if an attacker can), it's PII.

Intranet IP addresses are called "linked PII" in section 3.2.2 of the above NIST guidelines for example, and NIST does define some related terms that would seem to apply to surrogate keys like: * Distinguishable Information: Information that can be used to identify an individual. * Linkable Information: Information about or related to an individual for which there is a possibility of logical association with other information about the individual. * Linked Information: Information about or related to an individual that is logically associated with other information about the individual.

As a data engineer, the above interpretation means PII is in a zillion tables and labeling a table with a boolean indicator yes/no isn't that helpful. But as a policy person, NIST seems to be recommending gauging PII more at the system (not table) level and with a PII Confidentiality Impact Level of low/medium/high that takes into account the context and overall risk and that seems sensible. From a data cataloging standpoint (gauging what's "infected" to use your term), I think it's probably helpful to identify particular transactional tables or personal table as having "high" PII disclosure impact vs "low"; the presence of "infection" from a virus ("PII") is mostly irrelevant in a sufficiently large system where viruses/some PII is inevitable but what matters is the severity/impact. A zillion tables will be "low" (e.g. if most tables have audit column saying which user last changed a particular record) but certain transactional or user tables may be "high" and should be recognized as such and the focus of any risk discussions with the business or legal or breach notifications to customers or what have you.

Going back to your original point, I don't think the choice of natural vs surrogate key impacts the PII risk of the system or even its individual tables. I would slightly concede that a surrogate key (which in general I am in favor of) would make it easier to reduce a particular individual's PII from a system by concentrating it in one or a small number of tables with names, etc. which might be helpful for enabling GDPR right to be forgotten or something. But the degree of that PII elimination from a system by blanking out or archiving a particular user/person record is not necessarily reducing the PII for them to 0 at least definitionally unless the transactional records themselves are also removed as the AOL 2006 search data scandal demonstrated (where a woman identified solely by a surrogate key was able to be identified from her search term transactions alone.) (Legally there would appear to be carveouts around transactional deletion for some financial transaction records and backups, but IANAL...)

wavemode
5 replies
7h21m

Your perspective is very thorough and interesting, but I can simplify the matter a lot - yes, from my personal experience it absolutely does matter for privacy regulation whether you use natural or surrogate keys.

I've worked on systems in the biotech space where certain tables and certain columns in our dataset were considered PHI (personal health information), and others weren't. An auto-incrementing user ID does not qualify as PHI under HIPAA. Whereas a person's name or email address most certainly does (since, the very fact that a particular person had received lab testing at a particular lab, is itself considered PHI.)

Those PHI bits of data were scrubbed from the dataset that most regular employees had access to. Casually allowing all employees to have access to PHI would not have been compliant with the law.

The restrictions on what can even be logged (when there's a system error, for example) was very much controlled. "User 12345 experienced <some exception>" was perfectly fine (and greatly aided operational investigation, and customer support.)

gregw2
4 replies
6h56m

Thanks for the reply. I will concede or defer to you in regards to PHI and HIPAA... it seems the philosophy behind HIPAA/PHI is very different than PII or GDPR. HIPPA is prescriptive. PII/GDPR are principle-based. HIPAA, it seems, has some text that it's not PHI if the risk is "very small" based on the opinion of someone with statistical expertise documents that it could be de-identified OR if the person avoids an explicit list of 18 things that it cares about (see items (A) through (R) on page 96 of https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/ad... ).

One of the first 17 things might be a surrogate key ("account number") in one's system but if you look through the others, the rest are things like name, SSN, biometrics, IP#s, etc, which are definitely not surrogate keys.

The "OR" language makes the statistical expertise (and "principles" of privacy) irrelevant if you avoid the 18 things; that avoidance forms a "safe harbor" of sorts so you don't have to do any heavy thinking/lifting.

The 18th ("(R)") element of what is considered PHI does seem to refer to surrogate keys but in a manner which creates a clear carve-out/safe-harbor for them not being PHI. That 18th form of PHI is "Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section;"

But that paragraph (c) section seems indicate identifiers such as surrogate integer/guid keys kept within a system, as long as they are 1) not derived from an individual's information (ie like integer or UUID surrogate keys) and 2) which are maintained solely in the system are not considered as element-18-"R"-PHI:

"(c) Implementation specifications: Re-identification. A covered entity may assign a code or other means of record identification to allow information de-identified under this section to be re-identified by the covered entity, provided that: (1) Derivation. The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and (2) Security. The covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re- identification."

By my reading, a surrogate key maintained within a system is thus clearly not PHI under HIPAA. I never looked at the details of HIPAA until today since it hasn't applied much to my data and have been focused more on PII/GDPR. I appreciate you describing the context of your remarks.

randomdata
2 replies
1h38m

SSN ... which are definitely not surrogate keys.

Surely SSN is a surrogate key? They are not naturally derived. The early ones were serial (i.e. an auto-incrementing field) and more recent ones are randomly generated (i.e. a UUID).

shkkmo
1 replies
1h8m

SSN is absolutely not a surrogate key. If you received a piece of information from an external source, it is data, not a surrogate key.

If you use data as a key, then that is a natural key, if you invent a value to use as an identifier, that is an artificial or surrogate key.

If an API provides you an ID for a record, that is data. If you use it as a key, that is also a natural key in your system.

randomdata
0 replies
35m

> If you received a piece of information from an external source, it is data, not a surrogate key.

It may not be your surrogate key, but it is someone's!

wavemode
0 replies
5h21m

The other difference between HIPAA and GDPR is compliance rules. Under GDPR it is not a problem if all your employees who have access to your database have access to PII (since, from the perspective of GDPR, the customer has given consent to share their information with your organization as a whole). Sharing with third parties outside your organization is where you get into trouble.

But under HIPAA, even your own employees need to have a specific, documented justification for accessing customer PHI. If they can do their job just as well by only accessing a non-PHI dataset, you're required by law to only allow them access to a scrubbed, non-PHI dataset.

twodave
1 replies
6h9m

Agreed, and also they tend to be more guessable. Make a page available with an email address as the id and just watch the hackers use it to discover users of your service and attempt to log in as them.

sgarland
0 replies
5h55m

That is an AuthZ problem, orthogonal to DB key debates.

whyever
0 replies
10h5m

AFAIK, even using a UUID is still considered PII if it uniquely identifies the user.

hehdhdjehehegwv
0 replies
14h6m

Extremely good point.

ltbarcly3
10 replies
15h55m

Sometimes you will regret it, other times you won't.

Notice that even if you believe that you should 'never use natural keys', that is NOT the same thing as 'always add a generated synthetic key to every table as the primary key'. You should NOT always add a generated key to every table, even if you use a trash ORM that really wants you to do this.

aaronbrethorst
9 replies
15h50m

That’s quite the absolute statement to make without even one example.

hansvm
5 replies
15h19m

Maybe? Their statement was that you should "not always add," as opposed to the much stronger "always not add."

selcuka
4 replies
14h9m

At least the latter provided some examples.

Twisell
3 replies
13h56m

Here's a real world example if you run a script once every 5 minutes that launch sub-task you might be tempted to add a non natural auto-increment number that identify each occurence to create a link between the script and the subtasks.

However it will be way more painless to use a timestamp of the script starting point a natural key.

This way when shit happen you have a relevant and stable way to identify each occurrence. And timestamp is easily indexable and ordered.

aaronbrethorst
1 replies
13h44m

Fast-forward a couple years: now I have scripts that launch sub-tasks more than once a second.

yellowapple
0 replies
8h48m

So use timestamps with sub-second precision, which virtually every SQL database supports (even ones like SQLite that don't have built-in date/time types).

bruce511
0 replies
13h35m

If you want a timestamp, by all means add a timestamp field. And of course add a (non primary) key on the field.

I would add a different surrogate field though for the primary key. Because I've -always- discovered edge cases which break my original model.

In this situation I can think of 2. Firstly if the process triggers twice at the same time you have a problem (which happened to me in a case where a test suite fired up thousands of instances on a machine with many cores). Secondly twice a year with daylight savings (and when the owner of the system decided to change from local time to utc).

Is there an actual natural kh out there that will never fail me? Probably. But I'm tired of looking for it.

ltbarcly3
1 replies
5h55m

I don't think you read it carefully.

aaronbrethorst
0 replies
3h24m

I did. I just hate encountering databases designed by people who take your advice to heart.

_flux
0 replies
13h22m

How about:

- Internet scanner that would use host IP as the primary key

- A table connecting two other tables using their primary keys as the composite primary key (when there can only be zero to one such connections)

bjourne
8 replies
7h13m

Hard disagree. You create tables for them to be queried by applications. For the Restaurants example the application clearly wasn't designed to handle Restaurant entries whose only field differs is rank. Consequently, you shouldn't allow data the application can't handle. Whenever the time comes and you think "ah, wouldn't it be great if multiple people could review the same restaurant in the same year?" you change the natural keys. You can always relax the rules. Tightening the rules is much harder because garbage data may already be present.

Keys are about the integrity of your application(s) and preventing corner cases by making them impossible.

konschubert
4 replies
7h11m

You can use a UUID as a primary key and still enforce uniqueness on a combined index of city and name.

nkozyra
2 replies
7h1m

If they conflict and are user facing identifiers aren't you then forced to add uniqueness to the city and name, ala /chicago/chipotle_s4, potentially bleeding some business details?

I don't love uuids as public identifiers for a number of reasons but not hinting details about your data is one nice thing about them.

konschubert
1 replies
6h49m

Hmm, I am not sure if I get you.

The cool thing about enforcing uniqueness on a secondary index is that you can just change or remove the uniqueness constraint anytime without breaking foreign key relations.

nkozyra
0 replies
3h0m

I'm not sure why you want a uniqueness index then in this case. If it's not publicly visible (effectively a secondary index / reference), then you can't use it as an identifier.

This means you need something else (a slug, as discussed herein) or just use the ID (bad practice in general) or UUID (bad for humans). If you use - in this example - city and company name - you still have to enforce uniqueness, so you have a pseudo-slug anyway.

bjourne
0 replies
6h37m

But what's the point? You're complicating the data for no apparent gain. Only terrible RDBMSes which don't support multi-column keys require surrogate keys.

zarzavat
2 replies
7h3m

You can’t change primary keys, that’s the point, because you don’t know where they are.

For example if an old key is in a URL, and that URL is in a browser bookmark, now you need redirects, so you need to keep all the old keys around forever. Keys should be random or sequential, never contain information.

If you want to enforce uniqueness then use a unique index/constraint.

bjourne
1 replies
6h46m

I think you are confused about the terminology because a primary key is a uniqueness constraint. They can be changed in any RDBMS worth its salt. Keys should not be random and your URL example is a case in point. The url /germany/berlin/2023/mcdonalds contains no surrogate key and is immensely more useful than the url /review?uuid=1kjksdhh3244ygdvvgdd2345.

zarzavat
0 replies
5h52m

A primary key implies a uniqueness constraint. But you can have a uniqueness constraint without a primary key.

You can change primary keys within a database but you can’t change them outside the database. A database is a map but keys are part of the territory.

The second URL with the UUID is more secure and is preferred in many situations where information leakage is a concern.

The first URL is more descriptive but more prone to breakage. It depends on the situation.

Spivak
8 replies
15h49m

Natural keys are great when they're someone else's synthetic keys. My Discord bots make liberal use of this.

   model: DiscordMessage
     key = discord_message_id
     key = discord_user_id
     key = discord_channel_id
     key = message_type

   model: UserData
     key = discord_user_id
     key = field_name

   model: GuildFeature
     key = discord_guild_id
     key = well_known_feature_name

   model: DiscordEvent
     key = discord_event_id
     key = discord_guild_id
This is how I was always taught to use them so I'm kinda confused what the author is getting at. If the thing you're using as a key can change and that doesn't define the row then you don't have a key. If you're at a conference where everyone has a badge number then that's a great natural key when scanning people into your workshop. If you're at Disney and everyone has a magicband with ids then you can use that as a natural key all over the place when to you that's a visitor.

fn-mote
6 replies
15h42m

I think this is an excellent example of one of the pitfalls that the author is getting at.

(Disclaimer: not a Discord bot programmer.)

Suppose discord_user_id is an identifier like @Spivak#2024 that uniquely identifies the user. When Discord forces all users to change to unique ID's and eliminates the disambiguating numbers (or allows users to change them), then where does that leave you?

(For the unfamiliar: something like this happened. Not sure if those are really the "discord user id" used in bot service though, but suppose that they are for the sake of the argument.)

tylersmith
1 replies
14h52m

Which only reenforces the point that using what is exposed as the ID is not safe as a natural key.

Spivak
0 replies
14h48m

You're reading this wrong. The real ID (the Snowflake) is exposed in the API, and the UI if you turn dev tools on. That's the one I use. You don't use @username#1209 but 1247755691426542433.

iaabtpbtpnn
1 replies
15h38m

It probably leaves you in the same place you’d be with a synthetic key of your own: unable to track users across this key change without additional Discord data.

foota
0 replies
15h2m

Yes, but if you then reference that as a foreign key in some other table or system, then whatever migration you need to do will encompass those systems as well, whereas if you use a synthetic key from the start you wouldn't need to change them.

rsynnott
0 replies
6h26m

Does Discord guarantee, _absolutely guarantee_, that they will never change their ids? This might be a safe practice if they explicitly claim that, and you trust them absolutely. Not otherwise.

parpfish
7 replies
15h12m

There another reason not mentioned — if your key is something like a UUID, it’s very easy to define the logic for joining and filtering based on that key.

If you were using some sort of string like an email address or username, you have to think about case sensitivity and trimming white space and all sorts of preprocessing and then make sure you do it consistently EVERYWHERE

Cyph0n
5 replies
14h53m

Ideally, you should aim to sanitize/normalize strings on the write side rather than resanitizing on every read.

parpfish
2 replies
14h26m

That helps part of it, but there are still places for problems to pop up.

- you need to do an adhoc query to look something up so you have to type in the key in a where clause

- Or you used different sanitation methods in two different databases and you need to join things now

- you try to join the table that had the sanitized email key to a different table that just so happens to have email but it wasn’t sanitized because it was an optional field and not the key

sgarland
0 replies
1h10m

you need to do an adhoc query to look something up so you have to type in the key in a where clause

Having a surrogate key here doesn’t help. A WHERE predicate can be rewritten as an INNER JOIN quite easily. Or you could use a subquery, or a CTE. Many options.

The other problems discussed are an engineering culture problem. Either you value correctness and consistency or you don’t.

phatskat
0 replies
1h38m

And then you get the other stuff like one-off migrations that are done by someone unaware of the business constraints, or a contractor who completely misses the ORM method, etc.

These of course should be caught by checks and balances, and you can’t count on anyone knowing what they need to know ten years after an implementation is done

sgarland
0 replies
1h6m

You could also have CHECK constraints on those columns to ensure that nothing is written incorrectly, and/or a pre-write trigger that casts them to lowercase.

Or you could make the email column case-insensitive, since it’s generally accepted to be CI anyway.

SoftTalker
0 replies
14h48m

Yes but any kind of bug could later introduce tainted values. A foreign key contstraint might save you, but not always.

shkkmo
0 replies
51m

Side note about using email addresses.

It is pretty common to assume that email addresses aren't case sensitive since many email providers treat them that way. Email addresses absolutely are case sensitive so you need to preserve case when storing them and be careful about case when using them.

This makes email addresses particularly unsuited to being used as a natural key. Most people treat them as case insensitive and you need to work around that, but you can't safely just treat them as case insensitive yourself.

mythas
6 replies
15h3m

Another example that happens surprisingly often in healthcare. A registration clerk will incorrectly enter a personal health number (PHN) into the system. Then the actuall holder of that PHN shows up. If this were the PK then the system just wouldn’t handle this case and the reg clerk would have a huge mess to sort out on the spot. A surrogate key on the Person table allows this registration to be made where 2 people have the same PHN in the system. Then cleanup can be handled after the fact to track down the first person, determine their correct PHN and update the record.

jen20
2 replies
14h48m

This sounds like it should probably be a workflow instead. Modelling the intent here is actually important.

mythas
1 replies
14h42m

Care to expand on this thought? Curious what you have in mind!

ttyprintk
0 replies
8h15m

(not parent)

Think about these different intentions, each deserving an audit trail with different attached metadata:

- brand new MRN for a newborn. The system should be able to provide info on the mother, at all points of care for the newborn.

- unknown person arrives in the emergency department unconscious. You’ll see imaging named John Doe in this case. The system should be able to retrieve info if and when identity is established.

- the machine-readable bracelet given at admission needs to be replaced. This is really two different cases: you intend to create a duplicate bracelet, or to correct an error in which all current info in the system needs to abruptly update.

romdev
1 replies
14h38m

Finding and eliminating duplicates is a very common software problem that is rarely solved in a reusable, user-friendly way that preserves history while eliminating redundant data. In fact, in 40 years of working with computers I can't think of a single UI that I'd want to emulate.

Terr_
0 replies
10h17m

Ultimately the key is the business-modeling that captures "duplicates" as Things That Happen.

That's the precondition for any sane UI, and sometimes it's not even obvious because the "duplication" has been transformed, reified into its own concept.

ethbr1
0 replies
14h39m

One of the most surprising things I learned about the US healthcare system is how often everything is mutating or being retroactively updated (assume it applies to other countries too).

Things you'd think would be constant after Step 1 often aren't, and processes are tolerant of their being corrected / re-entered after Step 52 has already been completed.

davidhyde
6 replies
15h41m

A natural key is what I would normally call unique index. Say first, surname and date of birth in an employee table as a bad (poor design) example. As opposed to a surrogate key like personId being an auto incrementing ID which is the norm since it can be easier to use when joining tables. I wish articles like these would explain terminology up front. I found it irritating to wade into the anecdotal trivia and not know what I was reading about and I’m very familiar with how databases work!

bongodongobob
5 replies
15h28m

One reason to not do that is because you could then figure out other users IDs by their hiring date. This ended up being a problem for me once with the auxiliary systems that relied on that ID, namely when it was used in links in other applications.

eddd-ddde
3 replies
14h34m

Why is knowing someone's ID an issue?

They are meant for identification, if anything it should be a benefit that they are easily guessable.

selcuka
2 replies
14h12m

Why is knowing someone's ID an issue?

It increases the attack surface as an authorization vulnerability will allow an attacker to enumerate and access all records. Yes, it is security through obscurity, but a random (e.g. UUID) scheme makes it harder.

kubanczyk
0 replies
13h7m

~128 bits worth of obscurity is considered real security for the time being. Assuming a cryptographically secure PRNG.

Thats like guessing a password 18 ASCII chars long.

bongodongobob
0 replies
12h54m

Exactly, it's not that my systems use security through obscurity, it's the other ones mine ties into.

This was years ago and you don't see it as much anymore, but think autogenerated links to shitty CRM, ticketing, and project management software where the link is the query aka - Blahsoftware.local/info/bunchofgarbage?=userid+garbage+view+sensitiveinfo.html type stuff.

tadfisher
0 replies
14h35m

Another reason is sharding, in which case any non-random (to be more precise, non-uniformly distributed) bits are going to skew the partitions.

begueradj
5 replies
10h33m

One of the most interesting part to me is the fact he is a university professor in a good country but rides a used/second hand car... in an era where status and success are highlighted by displaying what one buys and owns on social media.

vanderZwan
0 replies
10h30m

Highlighted by whom? University professors?

rsynnott
0 replies
6h31m

in an era where status and success are highlighted by displaying what one buys and owns on social media.

I mean, not for everyone? These things, pretty much regardless of era, matter a lot to some people and not at all to others; for still others it may be a negative signal of sorts.

Anecdotally, I know a lot of very well-off people; well under half would have fancy cars.

mobiuscog
0 replies
10h19m

in an era where status and success are highlighted by displaying what one buys and owns on social media

This is no different from the people who used to show off their rolex in the pub, or park their BMW on display thinking it was impressive.

The 'show' may have moved, but it's still the same people who gain the same amount of 'respect' that they ever did (not much).

Besides, I would guess that most professors are not so worried about their car (and in fact many may prefer a more sustainable mode of transport) as they are to their research and citations, and the advancement of their students.

lordnacho
0 replies
6h46m

Cars in Denmark carry a huge tax, it's not unusual that you can only afford a used car. Plus public transport in the cities is pretty ok.

fuzzfactor
0 replies
6h30m

This is an educated person who shows higher-class tendencies by riding a bicycle, compared to those who seek mere automotive eminence.

Even if the bicycle is older than the car, perhaps even more so.

mootothemax
4 replies
10h10m

You can get performance benefits from using natural keys, as many databases store rows in the order of a table's primary key (sometimes called the clustered index, though it may or may not have a unique constraint requirement depending on the DBMS and a few other bits).

In the author's example, if the first column in the natural index was the city name (or city ID!), and locations are often pulled from the database by city, you'll see a read time performance benefit because each cityName's restaurants will be stored together.

This is why UUID-based systems can suffer worse read + write performance; their rows will be stored in the order of its UUIDs (that is, randomly spread around), making read, insert, and update performance lower.

What to do? I favour a mixed approach: have a unique integer ID column used internally, expose a unique UUID to the public where necessary and - with a BIG DO NOT OPTIMIZE PREMATURELY warning - really think about how the data is going to be queried, updated, and inserted. Ask - does it makes sense to create a clustered index based on the table data? Is there enough data to make it worthwhile? Where can the index fail if changes need to be made? Under some circumstances, it might even make sense to use a natural key with the integer column included right at the end!

The only hard rule I have is using UUIDs for clustered indexes. Unless the tables are teeny-tiny, the system is most likely suffering without anyone being aware of it.

williamdclt
2 replies
9h56m

many databases store rows in the order of a table's primary key (sometimes called the clustered index [...])

Note for readers: Postgres doesn't do that

sgarland
0 replies
5h51m

Nope, but it does have a Visibility Map due to its MVCC implementation. Go ahead and do an index-only lookup on a table using a UUIDv4. Then repeat with something k-sortable. See how many buffer hits each one has.

I assure you, those add up. Nothing like explaining to a dev that they have hit the theoretical memory transfer limit of the hardware, and it cannot possibly go any faster unless they refactor their schema (or a faster DDR version comes out).

mootothemax
0 replies
9h27m

> many databases store rows in the order of a table's primary key (sometimes called the clustered index [...])

Note for readers: Postgres doesn't do that

And I am climbing out of the rabbit hole that is Postgres' CLUSTER keyword :) Really funky, looks like it's a one-time operation that isn't persisted?

Looks like Postgres stores its index entries in ascending order by default (NULLs at the end), and if so, the point's worth keeping in mind on that front too.

I really need to do a deep dive on Postgres internals one of these days, it's an amazing - and different! - system its developers have created.

nicce
0 replies
7h25m

This is why UUID-based systems can suffer worse read + write performance; their rows will be stored in the order of its UUIDs (that is, randomly spread around), making read, insert, and update performance lower.

This isn't always the case anymore. UUID standards have been developed in that mind, and they are not completely random anymore. UUIDs can give a hint, for example, about the time when it was created, which gives them some order.

the_gipsy
3 replies
12h58m

In Spain the ID numbers are assigned at birth, carry no information, and cannot be changed. Each police comissary that registers birth gets a unique set of IDs to assign (per year or whatever).

However. Mistakes still happen. A colleague had the same ID as someone else. He said he tried to change it, but it was impossible because it was such an impossible concept to any public servant involved. In the end he gave up and just lived with the fact.

shkkmo
0 replies
47m

Even if this is true now (which it isn't entirely), that is no guarantee that it will be true in the future. Any time you use an external value as a key, you run the risk of a policy change. An internally generated artificial key is the only safe way to ensure immutability and lack of semantic content.

fcatalan
0 replies
11h2m

Not really assigned at birth, but at the time you first request the card. It's a very very very crappy database ID number:

-There's a significant amount of duplicates, enough that if your database is big enough you will find one sooner or later.

-Format is variable: There are older 7 digit numbers, modern 8 digit ones, some people pad the 7-digit with a 0, some don't, some consider the CRC letter part of the key, some don't, some just append the letter, some hyphen it. If you deal with manual data entry anywhere, you will suffer.

-Foreign people exist: You start using their passport number, the format of your key is now completely arbitrary, can't validate it. If you were using 8 digit NIFs as keys now some countries use 8 digit passports, increasing your chances of duplicates.

-Foreign people stay: They get a resident card and they use that. Format is almost like NIF but not really so you need to account for that. Someone you registered initially with a passport now has the card and registers for something else with it. Have fun cleaning their duplicated identity from the whole system.

-Foreign people become Spanish: Now they get a NIF. If you have dealt with them for a long enough time, have fun again fixing their records for the second time.

benhurmarcel
0 replies
8h33m

Foreigners change from a NIE to a new DNI if they get Spanish nationality. So their ID number changes then.

mobiuscog
3 replies
11h49m

It's also important to put as much thought into any synthetic key.

It's easy to just use an auto-generated sequence... but then you start having to export/import or otherwise merge data, and other manipulations that often use the primary key and find there are collisions everywhere. There can also be problems when needing to support multiple databases, or update versions.

UUIDs (or equivalent synthetic keys that are independent of the database itself) are often the best answer for this reason.

Having been bitten by sequences so many times in the past, I find them to often be more trouble than natural keys in the first place - just a lazy approach.

globular-toast
0 replies
11h27m

It's pretty easy to migrate to UUIDs later if required, as long as you don't leak your keys.

My work made their own uuid-like scheme, similar to UUIDv1 which incorporates several elements like machine ID and timestamp. The mistake was twofold: first they exposed them (so people started using them) but, worse, they made them easily reversible, so people started decoding the information in them. People would naturally see one and think "oh this is a record from place X". Of course that might not be true following subsequent data corrections, but the key can't change.

akvadrako
0 replies
11h4m

UUIDs are only good when you don't care about ergonomics or performance.

Snowflake IDs are much more reasonable. 41 bits timestamp (ms) + 10 bit machine id + 12 bit serial.

But if you care about ergonomics and privacy the most then short and random IDs are really the best.

meekaaku
3 replies
7h37m

I am firmly in the control your primary key, preferably uuid.

Quick question for database gurus here. Is it ok to have a currency table using the currency code as primary key?

zarzavat
0 replies
6h48m

This is interesting but it doesn’t really seem like a separate currency. It’s more like Scottish pound vs English pound, it’s the same currency but different banknotes for different regions in the same state.

brabel
0 replies
7h29m

I think long term, that is a bad idea.

In Brazil in the 80's and early 90's, the currency changed name several times. It was called "cruzado", then "cruzeiro", then "cruzado novo" (yep, "new" <old-name>, very creative) and then I think it went back to just "cruzado" :D before finally becoming Real (which I believe was the name of the currency also during Monarchy, 100 years earlier).

rhelz
2 replies
14h55m

Natural keys are (quite literally) essential to defining entities. Quine had a great slogan for this: "No entity without identity!" Natural keys are how you determine the identity of an object, which is to say, if you have two different referring expressions, how can you tell whether they are referring to the same object, or different objects?

Suppose you take the advice of this article, and use, say, social security numbers to identify people. (Let's ignore the fact that this only works for the U.S.) Suppose one fine day, somebody tries to enter in a record about a new person--but some typo has happened somewhere, and the new person's SSN clashes with somebody's SSN who is already in the system.

Sure, the database will notify you that something went wrong. But how do you know which social security number is correct, and which is incorrect? You will have to find a set of fields which uniquely identifies each person ANYWAYS. I.e. you'll have to differentiate them by name, birthday, place of birth, etc etc.

Now even worse!!! What if somebody attempts to add the same person TWICE to your database, but mistypes their social security number. Now your database can't even tell you that something has gone wrong. It will happily record duplicate or contradictory information about the same person--and in order to resolve the mess, again, you have to find out what uniquely identifies the people ANYWAYS.

Now, even worser than worse--what if you have taken the advice of this article, and you haven't bothered to identify a set of fields which are genuinely unique to each person. You just have a social security number and a name. How are you going to even going to correct the fact that John Smith is in your system twice, when you have ten John Smiths? How can you possibly tell which two John Smiths are the same John Smith?

Yeah, it take some time and careful thinking to properly come up with natural keys for your entities. But unless you do, you haven't actually specified your entities at all. This is one area where long years of experience really pay off. Expert data modelers spend years and decades honing their craft, observing the work of others, etc etc. Eventually they acquire the wisdom needed to be able to know what kind of information is really needed to nail down what kind of entity the database needs to know about.

yuriks
1 replies
14h31m

Suppose you take the advice of this article, and use, say, social security numbers to identify people.

You seen to have misunderstood the point of the article: the author is recommending NOT using the SSN (a natural key) for primary keys, and instead to use an artificial, automatically generated key, so that the SSN is decoupled from the record and can potentially be updated.

rhelz
0 replies
5h20m

But SSN's are artificial, automatically generated keys. They are not natural keys, they cannot be natural keys for persons, because not every person has one.

A natural key is a set of attributes which an entity has even if it is not in your database.

projektfu
2 replies
15h18m

The devil's advocate cases:

1. When duplication occurs and goes unnoticed because the natural key isn't being used, and then perhaps the problem is "corrected" by an administrator in a way that doesn't make sense.

For example, someone sets up an account with an street address A, then forgets they had it and sets up an account with street address B. They call and complain that they can't find an old order, or whatever, and the duplication is discovered. The administrator later clobbers one of the addresses but both are in use. A natural key (behaviorally speaking) may have presented the duplication, assuming address isn't part of the natural key. This can be satisfied by uniqueness constraints, etc.

2. When you are browsing the database, you see a bunch of synthetic keys and have to perform various joins to be able to see the relevant data. The synthetic keys make joins easier to write, but make may make ad-hoc queries more time-consuming and difficult.

roenxi
1 replies
15h7m

A synthetic key is adding an additional dataum - so any problem that could be solved with the original data can be solved by the new data too (worst case, ignore the primary key). That is one of the best reasons to add a proper index - there is almost literally no downside. We're talking maybe a few bytes/record and arguably a trivial amount of administrative overhead.

Bearing that in mind:

1. Add a unique constraint on the natural key.

2. If the accuracy of joining by natural key is acceptable, join based on it. If that doesn't work, then a table design without a synthetic key wouldn't be fit for purpose either.

kubanczyk
0 replies
12h59m

In other words, don't mistake a primary key with a key or a constraint.

wodenokoto
1 replies
14h23m

Just to add on to the CPR number. As mentioned it contains a birthdate and a gender identification as well as a checksum.

Since many refugees don’t know (or can’t prove) their birthdate, they are given first of January. And enough first of Januaries are handed out that for some years there aren’t enough valid numbers, so numbers that fail the checksum are handed out too.

knallfrosch
0 replies
9h56m

Shows how the assumption that everyone has, or reveals, a known or at least approximate birth date cascades through the system.

stiiv
1 replies
5h42m

There is still something to be said for "real world uniqueness" (GIS coordinates) or deferring to a third party to establish identity (license plate numbers).

Identifiers like these aren't always available, but within many domains will be sufficient.

The idea here is not that these keys can't be somehow "invalid," but rather that it isn't our system's problem -- it belongs to some other authority.

jonstewart
0 replies
5h33m

TFA author cites many examples from human-oriented systems, which perhaps are more common and traditional domains for RDBMS design, where there often several layers of exceptions to the rules - his argument that synthetic keys work well for these domains lands.

I personally work far more with computer-oriented systems and their data, and natural keys work well for me. When well-chosen they allow me to do an initial load of the source data for analysis, and then aggregate such databases together later on for historical analysis without fear of conflict. The data are often immutable in these domains, too.

shireboy
1 replies
6h33m

In the example from the article I think the table should be named RestaurantRank(year,rank,restaurantid), with a natural key year,rank. It would have a fk to Restaurant(Id,name,etc). Choose a natural key if there is some natural unique field or combination. In this case unless restaurants can tie in their ranking year,rank can be unique

noisy_boy
0 replies
6h16m

You'll have to modify the natural key if later it is decided to add rankCategory ("best overall", "best ambience", "best value" etc) via rankCategoryId foreign key to the RestaurantRank table because a given restaurant can have the same rank in the same year under multiple categories. A synthetic key avoids you having to mess with the key to handle such cases.

perlgeek
1 replies
12h42m

Natural keys have a tendency to change over time in what is considered unique.

For example, you have a company, and every employee has a unique employee number generated by HR... until the company merges with another, that also has unique employee numbers, and suddenly the identifier becomes the tuple (organization, employee number) that becomes unique. If you've used the employee number as a foreign key in other tables, you have to change those too.

This is a somewhat contrived example, but I've had enough real, annoying examples happen to me in my career that I avoid natural keys.

knallfrosch
0 replies
10h8m

Ah, how I remember changing my name. People think it only happens now and only to 0.1% of people who decide to change their gender in a leftist Western society, but somehow they forget that people have been changing family names for centuries. And of course, my work email address also changed. Now THAT is fun.

mkleczek
1 replies
10h46m

The question is: if you have two restaurants having different scores but distinguishable only by their surrogate keys, how do you know which one to go to?

In other words - using surrogate key is an attempt (and the wrong one!) to fix the problem of missing important information in the database.

big_whack
0 replies
8h1m

If you use a surrogate key, you still need a unique constraint in the table (probably the same columns you would otherwise call your natural PK). If your unique constraint isn't sufficient to capture the difference you mention, you need to add more columns.

However, that's strictly better than the natural PK situation, where you would need to not only add new columns to the key, but also add those columns to all referencing tables.

iso8859-1
1 replies
10h22m

How it is possible for an article from 2024 to have a comment from 2009?

gpvos
1 replies
12h14m

In the example given, the most logical natural key to me would be (year, rank). Why otherwise would they have the year in there?

mathgorges
0 replies
11h42m

That was my first thought too. But like TFA says, that still wouldn't prtect you from clerical errors.

ccppurcell
1 replies
13h50m

Here in Czech republic, everyone has an id number (rodné číslo) but as a foreigner, I have multiple id numbers. Further complicating matters, I was given a rč with the wrong gender (the gender is encoded in the number) and my nationality was at some point incorrectly listed as "Ireland". I do wonder if the Czech government thinks I am just two completely different people.

Culonavirus
0 replies
12h43m

seems like a problem, you should get that czeched out

banish-m4
1 replies
3h46m

Use db-generated UUID "id" primary keys. Add a bigserial "order" field to tables that require stable temporal ordering, and then either create a view to "order by" it or always add "order by" to stable-required statements.

TexanFeller
0 replies
3h18m

One downside is UUIDs take twice the space of a BIGINT which is minor on the surface, but makes a huge difference when doing lots of joins on that key. I used to see 3-4x difference in some queries even using postgres’ uuid type(not the string column of fools). Doubling the space used by your keys also means less of your indexes stay locked in RAM.

azlev
1 replies
15h18m

I think somewhat different than the article in some ways.

The first example is why `UPDATE CASCADE` was implemented. So it's possible to use natural keys as identity without the fear of children table. At least in most databases it works.

The drawback of enumeration is real, so if you expose this key you'll need some authentication/ authorization mecanism.

Another good thing in natural keys is that you can eliminate part of the joins. You don't have to join the father table because the key in child table is known.

I think the biggest challenge is how to map logins to people. It's very common to interpret both as the same, but they are not.

arthens
0 replies
14h29m

The first example is why `UPDATE CASCADE` was implemented. So it's possible to use natural keys as identity without the fear of children table. At least in most databases it works.

This is only true in a system using a single database, not replicating data in external services, and not offering APIs.

And while it might work today when the service/website is still fairly small and self contained, the requirements might change at any time. So the title (You will [future] regret it) still applies to this solution

RecycledEle
1 replies
15h17m

For those who do not know, natural keys are when you have a primary key in a database table that is derived from the data itself.

An example is using a car's chassis number as the key for the record describing that vehicle.

rhelz
0 replies
14h54m

Perhaps a better example would be a fingerprint or a retinal scan instead of social security numbers as natural keys for a person.

MikePlacid
1 replies
13h4m

I already predicted that changing the number would prove to be either impossible, or have all sorts of cascading effects, ultimately terminating in official records no longer recognizing that the car is mine.

Oh my, I can feel that pain. Here’s what happened last week, caused by a change in… SSN? no - in our home address.

My wife earned some unemployment benefits three years ago, which were put on a plastic card issued by The Bank. She finally found time to access the funds (she’s a middle school teacher), but when she went to The Bank, they said they didn’t have the money anymore—they’d sent it back to California. So, she called California. They were like, “No problem, we’ll send you a check. Oh, you have a new address? Let’s change it. Wait, what is happening… oh, now your account is locked, it says: potential fraud”… They needed a supervisor to unlock it, which took 20 minutes. The supervisor unlocked the account, but because it was marked as potential fraud, they couldn’t mail a check anymore. Instead, they linked the account back to The Bank (20 more minutes), and she had to go there in person with her ID to get it checked.

So, she went to The Bank. But you can’t just walk in The Bank and show your ID to get it checked; you need an appointment. And to get an appointment, you need an account with The Bank. But her California benefits account? Oh, it is marked as potential fraud - it didn’t count. So, they spent 20 minutes to open a new account for her. She got an appointment for later that day, in 4 hours, went back to The Bank, and had her ID checked (yes, the second time in one day - they need to check your ID to open an account too).

Did she get the money then? Of course not—the account is still flagged as potential fraud. No cash possible. Call California again. California agreed to mail the check to the new address, probably by some oversight.

So my question is: do these “potential fraud” flags in databases ever die a natural death?

With some hope, sincerely, a Husband of a Potential Fraudster.

knallfrosch
0 replies
10h11m

Mailing a check to transfer money

Real 1800 vibes here.

JackSlateur
1 replies
2h7m

He is right: natural keys are difficult to handle

I'd just say this is not so right: As it turned out, though, whoever made that piece of software knew what they were doing, because the mechanic just changed the chassis number, and that was that.

Because, yes, everybody can just reprogram your car and change its VIN and stuff.

The manufacturer software have protection against that. But nothing prevents you from using another software. At the end, you can directly program whatever you want into the car : change the VIN, change the odometer etc.

duped
0 replies
2h4m

At the end, you can directly program whatever you want into the car : change the VIN, change the odometer etc.

And you can wind up with an undriveable car this way. Particularly the odometer if you roll it back and try and sell it, your state's DMV will probably reject the title transfer.

EVa5I7bHFq9mnYK
1 replies
12h58m

How is name and city a key for a restaurant? There are a dozen of mcdonalds in my city.

layman51
0 replies
12h32m

That combination of columns being a candidate key depends on the world or domain that your dataset is modeling. In this example, I think it was a list of the top 50 restaurants in the world and so it would make a fine key. But if that database was ever expanded then definitely you would run into collision issues.

zer0tonin
0 replies
9h38m

Yes, except sometime whole industries come together already has developed a system of unique identifier, and I can just use this as a natural key. For example, the Nasdaq stock exchange is identified perfectly fine by the string 'XNAS', I'm not going to generate an ID on top of it.

webprofusion
0 replies
12h15m

As a junior dev decades ago I was once asked to change a perfectly good database (with integer IDs) to use natural keys, so to look up (or SQL join to) a particular office for instance your query would filter on Country, City and Street with the expectation that you get one result.

This was the dumbest technical decision I have ever been asked to be a part of.

usrbinbash
0 replies
10h2m

The problem is: As soon as a synthetic key is used by an EXTERNAL system, it stops being synthetic; for all intents and purposes it is now an integral part of the record, because if you change it, the external system may not be able to find the record any more.

tigerBL00D
0 replies
6h18m

This is a bit contrived to illustrate a problematic case. Every business has a unique ID that's assigned to it before it can operate. EIN or even a state id together with state would be a fine natural key for most situations.

theelous3
0 replies
4h6m

Another counter example is when the thing you are recording literally _is_ the natural key. Phone numbers are a good example of this. There is no situation in which some data entry error will break a number. That is the creation of a new distinct (possibly) broken number. A phone number doesn't care about any externality in the world. It is what it is.

thatjoeoverthr
0 replies
4h6m

One may notice this has something to do with mutability. If there isn’t a surrogate key, the record isn’t mutable. The database may let you change it, but the new record has a new identity. Mutability as a concept requires a common identity across time. Languages permitting mutability are using a pointer or reference as an implicit surrogate identity. A typical database can’t offer this, hence the need to put explicit surrogate keys into the schema. You cannot say “this changed” unless you can refer to both samples as a common “this”.

tedunangst
0 replies
13h20m

Every blog post about database schema should use as its example the most obvious example, a blog. People are pretty familiar with the domain, and it even has many of the same challenges. You probably have a postid field in there somewhere, but lots of blog software also uses url as a natural key. Which is why sometimes the old title appears in the url. Can't change that. Except you can. Some software will do redirects, by creating stub records for old URLs that map to new ones. So that's not an intractable problem. Etc.

teaearlgraycold
0 replies
10h52m

What I’m doing right now is putting synthetic (auto incremented) ids on every table, but also adding lots of compound unique indices. We have a lot of long background processes that might need to be restarted in the middle. The natural key definition allows for easy idempotency in subsequent runs (the jobs only upsert with. collisions handled by a simple timestamp update).

Most entities are tied to a version record (more or less a God object). But some span across versions. I’m able to have the entities that aren’t tied to a version float around because they reference the natural key of the versioned entities, minus the version’s synthetic key.

sigtstp
0 replies
10h53m

If anyone works in bioinformatics, please, please, for the love of god, generate your own unique IDs. Database identifiers are not generally unique (the same id might get reused for e.g. protein variants). Even sequences are problematic: I've found the same seq with different ids (can't remember the db now), and they can change (sequencing or human error might've occured and they get updated).

sashokbg
0 replies
6h7m

I disagree with this article. The author takes some very edgy cases such as "undergraduate student" and "transgender CRP number".

Even the example for the restaurant and "time based id number" are bad because they all indicate that you just badly identified the entities (in DDD terms) and their identity.

So being bad at DDD doesn't mean that you can't use natural keys (although I myself have arguments against them)

samtheprogram
0 replies
12h6m

I technically use a natural key, if you could call it that, when using Supabase. Having a `public.users` table with a primary key that’s the same as the key for the internal user auth table (from Supabase).

Arguably not a natural key, or at least a contrived example, but: any downsides?

raverbashing
0 replies
12h31m

Many were the times, earlier in my career, when I decided to use a 'natural key' as a key in my own database. As far as I recall, I've regretted it every single time.

Correct. Infamously correct

Every "natural key" will have some way of fumbling things down the line.

Nothing is unique, not even joining your natural key with other deduplication info. Just save yourself the trouble.

praptak
0 replies
8h15m

Modeling identity doesn't end with synthetic keys though. "Generate own keys" only solves the problem of identity ownership. You now own the identity but it doesn't mean your identities are as they should be.

Say a customer used a different government-issued ID to re-register with your bank. A year down the line you notice that you have two identities for the same person. It might be a meh for an online game but if you are a bank this can make you run afoul of regulations. Can you handle the merge of all the relevant data? And merging is usually the easier of the two glitches - can you handle a split?

The point is that identity just like security requires thought from the start of the design.

For a domain where identity is really hairy (although admittedly with less consequences for screwing up) see https://news.ycombinator.com/item?id=4493959 "The music classifying nightmare". Also https://en.wikipedia.org/wiki/Identity_(philosophy)#Metaphys... for some philosophical perspective.

pphysch
0 replies
13h19m

The article should be renamed "natural primary keys". Natural primary keys have fatal flaws, but natural "secondary" keys are all sugar. Best of both worlds.

porker
0 replies
10h13m

As with everything, natural keys are sometimes useful or even good. Where business rules fit, I like using them. But you have to know that the value will never change or be reused.

otikik
0 replies
10h58m

One of my first jobs involved an account management software which created an ID for each user based on first name and surname. So John Smith would get SMIJOH. It was the user natural key and it could not be changed once created.

One woman got married and changed her name; se became really upset when she found out that we couldn’t change her login.

nuancebydefault
0 replies
8h43m

I'm a bit confused. I've never programmed databases, though i have a lot of programming experience.

The student made a structure that can be seen as a unit or table in a database. Is the database not clever enough to synthesize a unique id? Why does the programmer need to care about keys?

It is clear that the end user will not know anything about a key, it might be a byte offset in a file or a memory pointer. Why does the programmer need a key?

moi2388
0 replies
12h9m

I’ve just always considered natural keys public keys, and added a surrogate key as private key.

From the outside you won’t see it, but internally it saves a lot of headaches.

Space, speed, migrations, I have just never seen an actual downside to using a surrogate.

And if you are absolutely sure, add some unique constraints. Easier to change when you inevitably have to.

Mainly; I’ve learned that my initial assumptions are never correct.

mikl
0 replies
1h13m

If you use foreign keys in PostgreSQL, it is actually possible to update natural keys with little trouble via the “ON UPDATE CASCADE” mechanism on such keys.

lysecret
0 replies
10h21m

Again, every time I have seen natural keys in a big long lived system I have seen duplication. This is something you will have to deal with.

kijin
0 replies
13h31m

"Natural keys" just means that someone else can change them under your feet.

People change names, their phone numbers, passport numbers. Governments change their numbering schemes all the time. Corporations merge and split. Heck, even governments merge and split, surprisingly often, at the municipal level. All without your knowledge or consent. And you're left wondering why you suddenly have duplicate key errors in your database.

The only key that you can trust is one that you, and only you, control. That's the point of the surrogate key. Whether it's BIGINT or UUID v7 is beside the point.

keepamovin
0 replies
15h9m

The way I think of it is a key, in a database, is a record key. It is intended to identify a particular logical record, not the logical thing which is referred to by the record.

An an footnote, I'll add this doesn't mean that you need to have complexities like record versioning, record history, or anything. But couched in those conceptual terms where those things are possible, is a happy and safe space to be. In this space, a database records entries, as if they were each a single paper form with boxes where you fill out, in pencil, to erase if you like or not, the particulars of the thing you're recording. This form comes pre-stamped with a number: the record key.

In this cozy little world, you can be imperfect, and mistakenly (or deliberately, as your use case requires), file multiple slips that refer to the same logical thing, and yet all have different file ("record") numbers - or keys.

karmakaze
0 replies
13h34m

A simpler and more compelling reason in my experience is that relating tables is so much easier with synthetic keys. You can decide how large your synthetic keys are based on uniqueness requirements, natural keys have their own size and format that would then be copied to make relations. And if uniqueness depends on several values to be unique then dealing with compound foreign keys isn't fun or efficient. At this point, there would have to be a very compelling reason to want to use natural keys to offset this and I've rarely seen them. A case that may come up is for a miscellaneous table to use the synthetic key from another table with another natural value that uniquely identifies a row. I wouldn't choose this route if you ever think you want to relate anything to rows in this table, they should be leaves related only to the synthetic key being used.

jpnc
0 replies
13h45m

The title could probably be extended to 'You'll regret using natural keys as primary keys' and it would be right in some ways. Personally, I've come to the conclusion that it's probably best to use both surrogate and natural keys. Surrogates (IDENTITYs, UUIDs what-have-you) as PK from a technical perspective and natural key as 'PK' from a business/data modeling perspective.

jbverschoor
0 replies
9h13m

The amount of inexperience is staggering

iblaine
0 replies
14h38m

It’s funny that indexing used to be unavailable when processing big data on Hadoop. Now that we have options to do so we now care. As tech evolves we adapt.

hosh
0 replies
53m

There's another gotcha I found out at my workplace -- for Postgres, tables without a primary key has trouble when you want to use logical replication. The times you might want to use logical replication includes using AWS's database migration service to shrink allocated storage.

hateful
0 replies
5h30m

After reading many threads here, I think the final ruling may be: - If your record represents a physical being or object: Use a surrogate key.

People change.

Unrelated - if you have a list of emails or SSNs or license plates or VINs - we can think of these as foreign keys to a database we don't control.

fl0ki
0 replies
4h40m

I inherited a project where a dozen different entity kinds were all given UUIDv4s. Users mostly see namespaced strings, but actually searching by those strings is janky and unreliable for other reasons (including the fact that they can change), so whenever I'm asked to debug something I insist on reducing to UUIDs first so at least we're definitely looking at the same entities.

I didn't like the UUIDs at first, but it ended up being an unexpected boon for generic code to use the same key type for different entity kinds. What was less of a boon was that the string identifiers also have to be unique, but can change at any time, and depend on three-level namespacing (yes really) so there's far more that has to be tracked and enforced. The names are very important for UI use cases, but can never be the way that records reference other records, because that would just make them much harder to change with confidence.

The essay seems to assume you'll have either unique natural keys or unique synthetic keys, but having now worked on a project that does both at the same time for many entities, I think it's a third option worthy of its own analysis. My experience was negative but I can't deny that the end result ticks a lot of functional boxes.

fiatjaf
0 replies
7h4m

Such a long article, no convincing examples. Why did he waste so long on the stupid restaurant case that clearly makes no sense? Natural keys are good as long as you think one minute before deciding to use them.

dwheeler
0 replies
3h40m

I've found that using a simple unique integer id as the key usually works well. Integers are easy to explain and are simpler to enter than uuids. There are always ttade-offs, but if there are no other compelling issues, simplicity is best.

duxup
0 replies
15h58m

I just have some form of unique id field / synthetic key, everywhere.

Even if just for documenting issues it makes life easier. “Table whatever id 12345 is the record in question.”

I’ve just seen data / relationships change too much too often in new and interesting ways to believe in using a natural key.

daft_pink
0 replies
15h23m

Anyone that’s spent large amounts of time using vlookup in excel can totally relate. Mashing some fields together as a key works great until it doesn’t.

d--b
0 replies
6h8m

what if we had a restaurant named China Wok in Springfield, USA? Hardly unique. Add the state, you say? Probably still not unique.

Ok sure, but then you have 2 restaurants which are indistinguishable from one another in your database. It doesn't matter that the thing has a unique id next to it. You can't know which is which. That's not useful.

contingencies
0 replies
8h21m

My traditional schema was always to have id in every table and to make it an integer, then for pretty much anything it'd be two tables, nouns for the thing and a one-to-many relationship with noun_names for the names, where noun_names was comprised of id/noun/language/name, keyed on noun and language, with the language column FK'd to languages.id. I'd also have a languages.code column for ISO or IANA code (sometimes both columns or aliases), and a language_names table with language_used/language_named. So you could name English in Frisian, and Chinese in Tibetan. Haven't needed to build SQL in ages, though that schema works great. Certainly never had an issue with it once UTF-8 became standard. If you build schemas without i18n, you're asking for trouble.

One good thing about using a standard integer key is the cost of indexes is low (low memory use). One good thing about using id everywhere is that it's short and self-explanatory to programmers from any culture.

Lots of casual queries built of subqueries like...

select * from noun_names where noun=(select id from nouns where ...) and language=(select id from language where code='en');

Always felt this was the most readable. Always felt that LEFT/RIGHT JOIN stuff was bonkers. Onboarded a lot of serious junior devs, never had an issue.

canterburry
0 replies
9h26m

I don't think I have seen natural keys used in a database since the 90s.

While I too was taught natural keys, or combined keys, were the "intended" way to identify data, I was corrected very quickly that artificial primary keys were much more reliable and more convenient due to all the auto increment features etc. I am surprised this is even talked about anymore.

canary-hill
0 replies
10h57m

What has worked best for us is to have a sequential integer as the primary key as well as UUIDv4 as a surrogate key for every row. The applications would expose the objects only using uuid and never the primary key. The primary key always remain internal at the database level and never gets referenced or used at the application layer. In this way the security and privacy of objects are maintained and having sequential integral primary key for the object ensures the database design remains robust. Essentially, separating out the key for the database and the application. Is there any potential issue with this approach?

branko_d
0 replies
3h9m

The author seems to suggest that the choice is between natural and surrogate key.

In fact, the choice is between natural and natural+surrogate key.

- If you have a natural key, you have to enforce it, otherwise you risk data corruption. The question is do you also need a surrogate key? Sometimes you do, sometimes you don’t.

- If you don’t have an obvious natural key, then your surrogate becomes meaningful. You have to use something to distinguish between two “equal but not identical” rows, so you end-up showing the surrogate in the UI etc. In other words, it is no longer “pure” surrogate.

bdcravens
0 replies
6h44m

Another surprising type of data that people think are unique are tracking numbers. Some carriers (like UPS) do recycle them.

alanhaha
0 replies
8h22m

~~You'll regret~~ I regretted

__MatrixMan__
0 replies
13h35m

Natural keys are often the best you can do if you're trying to be partition tolerant.

With surrogate keys you end up with duplicate entities whenever a partition heals (supposing the entity appeared to both partitions while they were still separate).

DontchaKnowit
0 replies
57m

Yep. My last company had unique id columns for every table in the DB. Everythibg worked flawlessly as far as associating data.

My current company uses natural keys all over the place and its a total clusterfuck trying to associate data at times.

DeathArrow
0 replies
12h22m

Synthetic key is the easiest choice. Because the primary key in one table is going to be foreign key in another table. If I have an employee table I can use their email address as the key but then I will have to use the email in any other table that references employee table, such as salary table, days off table and so on.

Most of the times using an auto incrementing integer or UUID is just fine.

In NOSQL there's usually a synthetic ID used to uniquely identify a document, I've never seen people using natural keys or compound keys.