HN comments for: PostgreSQL and UUID as Primary Key

inopinatus

145 replies

21h29m

2024-07-05 20:59:05 UTC

The best advice I can give you is to use bigserial for B-tree friendly primary keys and consider a string-encoded UUID as one of your external record locator options. Consider other simple options like PNR-style (airline booking) locators first, especially if nontechnical users will quote them. It may even be OK if they’re reused every few years. Do not mix PK types within the schema for a service or application, especially a line-of-business application. Use UUIDv7 only as an identifier for data that is inherently timecoded, otherwise it leaks information (even if timeshifted). Do not use hashids - they have no cryptographic qualities and are less friendly to everyday humans than the integers they represent; you may as well just use the sequence ID. As for the encoding, do not use base64 or other hyphenated alphabets, nor any identifier scheme that can produce a leading ‘0’ (zero) or ‘+’ (plus) when encoded (for the day your stuff is pasted via Excel).

Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.

Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.

vbezhenar

57 replies

20h54m

2024-07-05 21:33:41 UTC

IMO using bigserial by default is wrong. Use whatever data type is appropriate. Not every table will grow to 4 billion rows and not every table will grow to even 60k rows. ID data type leaks to every foreign key referencing given table. Many foreign key usually will be indexed, so this further degrades performance. There are multiple data types for a reason.

riku_iki

22 replies

20h33m

2024-07-05 21:54:49 UTC

there is unlikely significant performance degradation for int vs big int, but it will be huge PITA, if 10 years later and tons of legacy code written that table will grow over 4B rows..

sgarland

21 replies

18h23m

2024-07-06 00:04:56 UTC

It’s double the size. 4 bytes * 2^31 (because Postgres doesn’t allow have unsigned ints, unlike MySQL) is 8.6 GB. That is quite a difference for an index, not to mention the table overhead.

You’re going to know well in advance before hitting this limit becomes an issue, and you’ll have plenty of time to either take a bit of downtime and do a column conversion, or do an online migration.

riku_iki

16 replies

18h13m

2024-07-06 00:15:32 UTC

4 bytes * 2^31 (because Postgres doesn’t allow have unsigned ints, unlike MySQL) is 8.6 GB

I didn't get your point. When it is 2^31, you definitely need bigint already. When it is much smaller, it will be much smaller overhead.

Also, per docs (https://www.postgresql.org/docs/current/storage-page-layout....), each postgres row has 23 bytes overhead, so your transition from 8->4 bytes will bring marginal improvement.

arp242

8 replies

14h18m

2024-07-06 04:10:07 UTC

With multiple tables and the same IDs being references more than once, this kind of stuff can really add up.

For example I have a table that has about a billion rows and uses bigserial, but that table references about 6 other much smaller tables that use serial. I'm saving 48 bytes per row, or 90GB in total. It's a fairly significant save, and that's just on this one table: I have a bunch of tables like this. If I had bigint'd everything my disk space would be roughly double. And all of that is not even considering the extra index size and memory requirements.

Using bigint here would add absolutely nothing. I'm never going to have billions of users. I'm never going to have billions of different operating systems. There will never be billions of countries. I can probably get away with smallint for some of those, but I felt the extra headroom for that was worth the extra bytes.

This is why "bigint by default" is just bad advice IMHO. You will always need to think about it. Of course you shouldn't prematurely optimize integer sizes, but it's also not true that it doesn't matter, because it does. "Better safe than sorry" also applies in the other direction: "zomg we're wasting tons of disk space and spend much more money on servers than we need to" just because someone unthinkingly applied some stupid dogma they read on HN and didn't spend 4 seconds thinking about it.

riku_iki

7 replies

13h32m

2024-07-06 04:56:11 UTC

I'm saving 48 bytes per row

you saving 24 bytes per row: downsizing 6 columns from 8 bytes to 4, which is fraction of your table size. If your system is sensitive to such change, you likely should optimize something else.

Using bigint here would add absolutely nothing. I'm never going to have billions of users. I'm never going to have billions of different operating systems.

I think you cherry picked some fictional examples which you won't combine in one table, and your real entities are different. It adds process safety: you don't need to play mind quizzes to estimate how system can grow/change in a N years.

arp242

4 replies

13h14m

2024-07-06 05:14:33 UTC

cherry picked some fictional example

What an incredibly rude and dismissive accusation. Here's my table: https://github.com/arp242/goatcounter/blob/master/db/schema.... – number of IDs is actually 7, not 6.

I can give a lot more details and context on all of that and why it works the way it works and the savings are certainly not insignificant and theoretical, but save me real money in server costs every month. But I have little interest in talking to you further since you're just going to insult everyone who tries to inject the slightest nuance into "always use bigserial"-extremism. So good day to you.

riku_iki

2 replies

13h7m

2024-07-06 05:20:51 UTC

https://github.com/arp242/goatcounter/blob/master/db/schema.... - number of IDs is actually 7, not 6.

the point is that some/many of those ids are not timezone and country, and potentially can grow high in cardinality.

arp242

1 replies

9h54m

2024-07-06 08:34:08 UTC

No, my point is you accused me of lying at the drop of a hat for no good reason, and that this is demonstrably not true. Don't try and spinelessly slime your way out of that. And no, they I won't run out of IDs for them. "Potentially" anything can happen, but it won't happen here. After five year the furthest sequence is 0.86% on its way of being full.

riku_iki

0 replies

2h57m

2024-07-06 15:31:01 UTC

my point is you accused me of lying at the drop of a hat for no good reason

I said you described fictional scenario, and it appears to be true, your table ID fields are very different than timezone, country and user_id.

sgarland

0 replies

4h10m

2024-07-06 14:18:37 UTC

I’d just like to congratulate you on perhaps the nicest “fuck off” I’ve ever read.

Unrelated, I quite enjoyed reading your blog posts. Cheers!

sgarland

1 replies

4h17m

2024-07-06 14:10:55 UTC

If your system is sensitive to such change, you likely should optimize something else.

This isn’t even optimization, it’s just understanding your tools and needs. It’s akin to profiling your app under load, seeing that at worst it needs 1 GiB of RAM allocated, and then giving it 8 EiB just in case.

By all means, if you can reasonably predict that a given table will near or pass 2^31 rows in the near future, just set it to BIGINT now. But as has been pointed out, most apps are never, ever going to have close to 2 billion users, or for that matter 2 billion anything. The only thing I can think of that might reasonably run up against that would be invoices for large scales, or object metadata for something like Notion.

riku_iki

0 replies

2h55m

2024-07-06 15:33:02 UTC

But as has been pointed out, most apps are never, ever going to have close to 2 billion users, or for that matter 2 billion anything. T

all computers will be fine with 640kb of ram

sgarland

6 replies

15h26m

2024-07-06 03:01:38 UTC

Not all tables need even the capacity for 2^30 rows, much less 2^31, or 2^63. If you have a reference table with things like timezone information, color schemes, etc. and are using anything other than a SMALLINT (2^15), you're probably wasting space.

As to the maximal 8.6 GB mentioned, that's not nothing, _especially_ for RAM. Disk is cheap, but RAM isn't. If you have a smaller instance – say, an r6i.xlarge on AWS (4 vCPU, 32 GB RAM), that's 1/4 of the entire DB's memory.

riku_iki

5 replies

15h19m

2024-07-06 03:09:13 UTC

so, what about my argument that PG has 23 bytes overhead per row and your space win is very small compared to that overhead?

nirvdrum

2 replies

12h21m

2024-07-06 06:06:56 UTC

Is that an innate property or a current implementation detail?

sgarland

0 replies

4h37m

2024-07-06 13:51:11 UTC

It’s the current implementation as of PG 8.2, I think. It’s [0] been there for a long time, in any case, and is unlikely to change in the near future.

[0]: https://www.postgresql.org/docs/current/storage-page-layout....

riku_iki

0 replies

12h16m

2024-07-06 06:11:39 UTC

price for various features, like ACID.

sgarland

1 replies

4h40m

2024-07-06 13:48:29 UTC

It’s orthogonal, and also technically incorrect – the row overhead itself is indeed 23 bytes (less NULL bitmap), but it gets aligned to 24 bytes. Run pg_column_size(ROW()) if you’d like to check.

The fact that this overhead exists has nothing to do with the 8.6 GB of wasted space on the INTs.

riku_iki

0 replies

2h46m

2024-07-06 15:42:09 UTC

It’s orthogonal

I disagree its orthogonal and explained why. I guess lets agree on disagree.

big_whack

3 replies

18h11m

2024-07-06 00:16:44 UTC

Postgres pads tuples to 8 bytes alignment so an indexed single-column int takes the same space as an indexed bigint. That's the usual case for indexed foreign keys.

Differences can appear in multicolumn indexes because two ints takes 8 bytes while two bigints takes 16, however the right layout of columns for an index is not always the layout that minimizes padding.

sgarland

2 replies

15h30m

2024-07-06 02:57:46 UTC

Postgres doesn't necessarily pad to 8 bytes; it depends on the next column's type. EDB has a good writeup on this (https://www.2ndquadrant.com/en/blog/on-rocks-and-sand/), but also here's a small example:

  CREATE TABLE foo
    (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid INT NOT NULL);

  CREATE TABLE bar
    (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);

  CREATE TABLE baz
    (id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);

  -- fill each with 1,000,000 rows, then index

  CREATE INDEX {foo,bar,baz}_iid_idx ON {foo,bar,baz}(iid);

  SELECT table_name,
         pg_size_pretty(pg_table_size(quote_ident(table_name))) "table_size",
         pg_size_pretty(pg_indexes_size(quote_ident(table_name))) "index_size"
  FROM information_schema.tables
  WHERE table_schema = 'public';

   table_name | table_size | index_size
  ------------+------------+------------
   foo        | 35 MB      | 43 MB
   bar        | 42 MB      | 43 MB
   baz        | 42 MB      | 43 MB

`foo` has an INT followed by an INT, and its table size is 35 MB. `bar` has an INT followed by a BIGINT, and its table size is 43 MB; this is the same size for `baz`, despite `baz` being a BIGINT followed by a BIGINT.

big_whack

1 replies

14h39m

2024-07-06 03:49:20 UTC

You seem to think you're disagreeing with me but afaict you're just demonstrating my point, unless your point is just about how (int, int) will get packed. That's what I meant about the column order of indexes. If you have two ints and a bigint, but you need to index it like (int, bigint, int), then you aren't gaining anything there either.

As your example shows, there is no benefit in index size (e.g for supporting FKs) in going from int to bigint for a single key. You end up with the same index size no matter what, not twice the size which was what I took your original post to mean.

sgarland

0 replies

3h30m

2024-07-06 14:58:28 UTC

I misunderstood your post, I think. I re-ran some experiments with a single-column index on SMALLINT, INT, and BIGINT. I'm still not sure why, but there is a significant difference in index size on SMALLINT (7.3 MB for 1E6 rows) vs. INT and BIGINT (21 MB for each), while the latter two are the exact same size. I could get them to differ if I ran large UPDATEs on the table, but that was it.

njtransit

19 replies

20h7m

2024-07-05 22:20:58 UTC

Using 32 bit ints for IDs is insane in today’s world. If an attacker can control record generation, e.g. creating a record via API, then they can easily exhaust your ID space. A lot of kernel vulnerabilities stem from using incrementing 32 bit integers as an identifier. If you’re considering using 32 bits for an ID, don’t do it!

incrudible

18 replies

19h32m

2024-07-05 22:56:30 UTC

If an attacker can create billions of records through your API, maybe that is a problem you need to address either way.

afavour

9 replies

18h58m

2024-07-05 23:30:30 UTC

Conversely: if using bigserial for a primary key introduces crippling performance problems to your system, maybe that is a problem you need to address either way.

DEADMINCE

8 replies

18h35m

2024-07-05 23:53:27 UTC

Conversely? Who mentioned there being a performance impact?

afavour

7 replies

18h5m

2024-07-06 00:23:20 UTC

It’s the only reason I could think of for why you wouldn’t use a bigserial column. Maybe there is another?

lazide

6 replies

17h41m

2024-07-06 00:47:23 UTC

Information leakage - it leaks how many records have been created [https://en.m.wikipedia.org/wiki/German_tank_problem]

afavour

5 replies

17h13m

2024-07-06 01:14:39 UTC

That’s why the original comment suggested both bigserial and a separate UUID for public exposure. More to the point the person I was replying to said:

IMO using bigserial by default is wrong. Use whatever data type is appropriate. Not every table will grow to 4 billion rows and not every table will grow to even 60k rows

The implication I took from that was that they were suggesting using serial over bigserial. My comment was pushing back on that.

jerf

4 replies

16h46m

2024-07-06 01:42:03 UTC

My Dream Web Framework, which for a variety of reasons was never and never will be built, has built-in functionality for obscuring IDs in some session-level map, so you can indicate through some sort of type that something is an ID and it automatically allocates some sort of randomized identifier on the way out and converts it back transparently on the way back in.

Thus, not only would DB ids in principle never show in the UI, they wouldn't even be stable between sessions.

This would protect against a number of security attacks. (I don't know what I'd do about bookmarking, though I suspect most apps have a 80/20 solution they could deploy.) DB ids going out to a user really ought to be considered at least a smell, but it's pretty ingrained into our culture.

gregoryl

1 replies

16h15m

2024-07-06 02:12:57 UTC

I wrote that style of session mapping for a project long ago. It was fairly easy, but a massive pain in the ass to debug. Ended up needing to record the mappings in the backend for a period of time.

bruce511

0 replies

14h14m

2024-07-06 04:13:39 UTC

I too coded that in my server. It's not hard to do, and debugging can be harder at times (but not excessively so.) As you say it's just a case of logging when debugging.

When anchors are needed I use a different column for that- not the primary index. (Usually some other unique, already-present, value like the sku or whatever.

The security implications though are substantial. I don't leak primary key info even if the PK is serialized. (These days I'm all-in on UIDS but I have older legacy systems to deal with too.)

ako

1 replies

10h54m

2024-07-06 07:34:23 UTC

That wouldn't work for integration scenarios where the other system need to store the ID so it can refer to it later. How would you provide IDs for integration purposes?

SpaghettiCthulu

0 replies

3h51m

2024-07-06 14:36:54 UTC

Not just integration with other services. It would completely break brower history too!

njtransit

5 replies

15h18m

2024-07-06 03:09:39 UTC

It’s about 100 records per second for a year and a half, or 10,000 records per second for 5 days. Both are easily achievable. As an engineer, why would you ever knowingly design such a system when it’s trivial to not have this vulnerability in the first place.

It’s like hosting an internal app at a company that contains a SQL injection. “Well, if a hacker can access this app, then that’s a problem that needs addressing either way.” Sure, that may be true, but it’s also true that you’re not a good software engineer.

arp242

2 replies

14h41m

2024-07-06 03:47:02 UTC

It’s like hosting an internal app at a company that contains a SQL injection

It's nothing like that at all because the wrong SQL injection can completely ruin people's lives due to leaking stuff it shouldn't whereas the worst an int exhaustion can do is bring some app offline. Whoopdie-doo. Okay, that's not brilliant, but it's not comparable at all.

And there's a reason there aren't tons of "int exhaustion attacks": because there's little point in doing so.

lazide

1 replies

14h31m

2024-07-06 03:56:40 UTC

So taking down the apps ability to insert any rows into the table (and hence breaking the app) isn’t going to impact anyone? Including the apps ability to make money?

This does happen and break people. You usually don’t hear about it (except on the SRE side) because it is so obvious when it happens to someone they really don’t like talking about it.

arp242

0 replies

9h26m

2024-07-06 09:02:27 UTC

I never said it's "not going to impact anyone", I said it's not comparable.

"Denial of service" is just not a security bug on the same level as "zomg my credit card is stolen". I have argued this many times before, e.g. https://news.ycombinator.com/item?id=39377795 – It's easy to knock most sites offline regardless just by flooding it with traffic, don't need long-term plans like this (which are likely to be detected before it even has effect).

This does happen and break people

I have never heard about a deliberate attack using this method.

rattray

0 replies

2h52m

2024-07-06 15:36:33 UTC

Either of those situations should be very easy to catch and prevent with basic rate limiting and monitoring.

fauigerzigerk

0 replies

9h44m

2024-07-06 08:44:37 UTC

That's not a great analogy. SQL injection is a completely binary issue. A single malicious statement that gets through can do all the damage.

Defending against denial of service attacks is a gradual, multifaceted problem that is pretty unlikely to hinge on the size of database keys.

If your system is dimensioned to serve a few hundered users then it's not going to be a whole lot more robust if it can theoretically create thousands of trillions of user records rather than just hundereds of millions.

In fact, infinite scalability is a risk in its own right because it can bankrupt you, which is arguably worse than a temporary outage.

That said, I tend to use 64 bit IDs by default as well unless there is a good reason not to (because, you know, what if I'm the next Google?!?).

lazide

0 replies

14h33m

2024-07-06 03:55:35 UTC

It doesn’t require an attacker to create billions of rows.

All it requires is for there to be billions of rows. 32 bits is nutty small considering how little overhead there is in storing 64 bits instead.

klooney

0 replies

13h41m

2024-07-06 04:46:48 UTC

If you expose an API, your customer's CI systems will probably get there eventually.

inopinatus

5 replies

20h34m

2024-07-05 21:53:47 UTC

Defaulting to 64-bit integers internally is to me a matter of mechanical sympathy, it has little to do with row capacity. It’s just a word size that current CPUs and memory architectures like working with.

vbezhenar

4 replies

19h42m

2024-07-05 22:46:25 UTC

What architecture? Both amd64 and ARM64 can work with 32-bit integers just fine.

DEADMINCE

3 replies

18h34m

2024-07-05 23:54:33 UTC

Not optimally.

vbezhenar

2 replies

18h1m

2024-07-06 00:26:44 UTC

It is not optimal to use 8-byte integers instead of 4-bytes.

CPU works just as fast with both, however your CPU cache is limited and you'll put more 4-byte integers into your L1.

I don't really understand what you want to convey. CPU is very fast with any kind of integer size. There's no performance penalty to use 1-byte integer compared to 8-byte integer. And there's performance penalty when your L1 or L2 or L3 cache is busted and you need to go to the next memory layer.

marcosdumay

0 replies

17h41m

2024-07-06 00:47:29 UTC

CPU works just as fast with both

They don't. CPUs have as many issues with data alignment as with cache sizes.

adrian_b

0 replies

11h28m

2024-07-06 07:00:01 UTC

1-byte integers or any other integers smaller than 8 bytes (64 bits) save space only when they are parts of arrays or structures.

As individual variables, they take exactly the same space as 64-bit integers, both in registers and in memory (i.e. in the stack used for automatic variables or in the heap used for dynamically-allocated variables), because of the alignment requirements.

Therefore it never makes sense to use other integers than 64-bit, except as members of aggregate data types.

A database table is an aggregate data type, so the use of small integers in some columns may be justified, but such types shall not be used for variables in the programs that process data from the tables, where misuse of the small integers may cause overflows and there is no advantage from using smaller integer sizes.

rangerelf

2 replies

19h53m

2024-07-05 22:34:47 UTC

I read your post and hear echoes of "Who would ever need more than 2 digits for the year in this timestamp column?"

Never again.

vbezhenar

1 replies

19h42m

2024-07-05 22:45:38 UTC

Using 2 digits for year is as wrong as using 8 bytes for year.

riku_iki

0 replies

12h39m

2024-07-06 05:49:22 UTC

scale of damage is very different, so "as" is up to discussion.

mixmastamyk

1 replies

10h46m

2024-07-06 07:41:49 UTC

Postgres doesn't support unsigned ints last I looked, so it's actually a ~2 billion limit.

Secondly, it's not XXXk rows currently-you have to consider the complete lifetime of the table. When rows are deleted/created and how often.

So what you've said is true, but the set of appropriateness for smallint is a much smaller than expected.

blackenedgem

0 replies

52m

2024-07-06 17:35:54 UTC

You can start the sequence at -2b, or wrap it around when it gets close to the signed limit. Hopefully you haven't depended on it not wrapping around by that point.

For queue tables you can even use `CYCLE` to do that automatically.

paulddraper

0 replies

11h14m

2024-07-06 07:14:10 UTC

The problems from unnecessary bigserial is nothing compared to the pain of insufficient serial.

This is especially true for PostgreSQL which increments sequences for upserts (even if no records are inserted).

That's how I've hit 32-bit limits on tables that had only a couple million rows.

---

I would only use 32-bit for very selective items that is used in a lot of FKs, like a tenant ID.

orf

0 replies

6h21m

2024-07-06 12:06:43 UTC

This is simply bad advice. There are very, very few cases where you shouldn’t use a bigserial/bigint as a primary key. Perhaps the only case is where it is impossible for the rows to grow beyond a specific bound: an “AWS region” table or a “countries” table. These are almost always lookup tables of some kind, and the domain makes it impossible for it to grow past a certain bound.

Outside of that, put absolutely no thought into it and pick a bigint/bigserial.

Trying to be clever about this ignores the fact that tables and their usage change, and what you believe now will not be correct later.

And when you realise you’ve been too clever, it’s 3am on a Sunday, your production database has exploded and nothing works. You now need to rewrite 30 tables and their foreign keys ASAP with upper management breathing down your neck.

Merad

0 replies

2h26m

2024-07-06 16:01:52 UTC

An int PK does not give you 4 billion rows. For example, in Postgres every time an insert transaction has to rollback, the id's that would have been used by that insert are discarded and gone forever. Likewise people often don't realize that Postgres' very convenient upsert syntax (insert on conflict do update) will consume id's from the sequence _every time it runs_. If you do an upsert with 100 rows, it will peel off 100 new id's just in case they're needed... if they aren't used, they're discarded. This can chew through 4 billion id's really, really fast.

Personally I would only use an int PK in a table where you know the bounds of the data with very high confidence. Like, at my last job the software involved working with our client's physical store locations. If we managed to capture the entire North American market for the type of business that we worked with, we'd be looking at about 50k stores.

GGO

50 replies

20h20m

2024-07-05 22:07:54 UTC

I dont understand the recommendation of using bigserial with uuid column when you can use UUIDv7. I get that it made sense years ago when there was no UUIDv7, but why do people keep recommending it over UUIDv7 now beats me.

whynotmaybe

44 replies

18h33m

2024-07-05 23:55:23 UTC

As uuid v7 hold time information, they can help bad actors for timing attacks or pattern recognition because they contain a time information linked to the record.

You can guess the time the system took between 2 uuid v7 id's.

They can only be used if they're not shown to the user. (so not in the form mysite.com/mypage? id=0190854d-7f9f-78fc-b9bc-598867ebf39a)

A big serial starting at a high number can't provide the time information.

TeeWEE

21 replies

15h4m

2024-07-06 03:24:31 UTC

Big serial is sequential and it’s very easy to guess the next number. So you got the problem of sequential key attack…

If you use only uuid in your outwards facing api then you still have the problem of slow queries. Since you need them to find the object (as mentioned below)

UUIDv7 has a random part, can be created distributedly, and indexes well.

It’s the best choice for modern application that support distributed data creation.

badestrand

20 replies

11h20m

2024-07-06 07:08:10 UTC

You never expose the bigserial, you generate a ID (like UUID) for external use/identification and simply have an index over that column for fast selects.

kroolik

15 replies

8h23m

2024-07-06 10:05:07 UTC

Having an index over the uuid is equivalent to it being a PK, so why would you bother having both?

blackenedgem

14 replies

8h13m

2024-07-06 10:15:28 UTC

Because it's much better for range queries and joins. When you inevitably need to take a snapshot of the table or migrate the schema somehow you'll be wishing you had something else other than a UUID as the PK.

groestl

12 replies

7h16m

2024-07-06 11:11:56 UTC

Ha? Please elaborate.

kroolik

11 replies

7h0m

2024-07-06 11:28:26 UTC

When running a batched migration it is important to batch using a strictly monotonic field so that new rows wont get inserted in already processed range

blackenedgem

3 replies

4h43m

2024-07-06 13:44:46 UTC

It's not even necessarily it being strictly monotonic. That part does help though as you don't need to skip rows.

For me the bigger thing is the randomness. A uid being random for a given row means the opposite is true; any given index entry points to a completely random heap entry.

When backfilling this leads to massive write amplification. Consider a table with rows taking up 40 bytes, so roughly 200 entries per page. If I backfill 1k rows sorted by the id then under normal circumstances I'd expect to update 6-7 pages which is ~50kiB of heap writes.

Whereas if I do that sort of backfill with a uid then I'd expect to encounter each page on a separate row. That means 1k rows backfilled is going to be around 8MB of writes to the heap.

valenterry

1 replies

4h32m

2024-07-06 13:55:38 UTC

Isn't that solved because UUIDv7 can be ordered by time?

blackenedgem

0 replies

3h46m

2024-07-06 14:42:29 UTC

Yeah pretty much, although ids can still be a little better. The big problem for us is that we need the security of UUIDs not leaking information and so v7 isn't appropriate.

We do use a custom uuid generator that uses the timestamp as a prefix that rotates on a medium term scale. That ensures we get some degree of clustering for records based on insertion time, but you can't go backwards to figure out the actual time. It's still a problem when backfilling and is more about helping with live reads.

kroolik

0 replies

3h17m

2024-07-06 15:10:55 UTC

Are page misses still a thing in the age of SSDs?

x3al

2 replies

4h8m

2024-07-06 14:20:24 UTC

Strictly monotonic fields are quite expensive and the bigserial PK alone won't give you that.

kroolik

1 replies

3h22m

2024-07-06 15:06:22 UTC

PG bigserial is already strictly monotonic

blackenedgem

0 replies

1h23m

2024-07-06 17:05:03 UTC

No they're not, even with a `cache` value of 1. Sequence values are issued at insert rather than commit. A transaction that commits later (which makes all updates visible) can have an earlier value than a previous transaction.

This is problematic if you try to depend on the ordering. Nothing is stopping some batch process that started an hour ago from committing a value 100k lower than where you thought the sequence was at. That's an extreme example but the consideration is the same when dealing with millisecond timeframes.

groestl

1 replies

4h40m

2024-07-06 13:48:10 UTC

Okay, but in a live DB, typically you won't have only inserts while migrating, won't you?

kroolik

0 replies

3h21m

2024-07-06 15:06:51 UTC

Yes, but updates are covered by updated app code

asah

1 replies

5h42m

2024-07-06 12:46:13 UTC

would creation/lastmod timestamps cover this requirement?

kroolik

0 replies

3h18m

2024-07-06 15:09:48 UTC

Yes, although timestamps may have collisions depending on resolution and traffic, no? Bigserials (at least in PG), are strictly monotonic (with holes).

rezonant

0 replies

6h41m

2024-07-06 11:47:10 UTC

This. Highly recommend using a numeric primary key + UUID. Using UUID relations internally can have some strategic advantages, but when UUIDv4 is used as the only primary key, you completely lose the ability to reliably iterate all records across multiple independent queries.

Also, the external thing isn't just for exposing it out to your own apps via APIs, but way more importantly for providing an unmistakable ID to store within external related systems. For example, in your Stripe metadata.

Doing this ensures that ID either exists in your own database or does not, regardless of database rollbacks, database inconsistencies etc. In those situations a numeric ID is a big question mark: Does this record correspond with the external system or was there a reuse of that ID?

I've been burnt taking over poorly managed systems that saved numeric IDs externally, and in trying to heal and migrate that data, ran into tons of problems because of ill-considered rollbacks of the database. At least after I leave the systems I build won't be subtly broken by such bad practices in the future.

iterateoften

2 replies

8h20m

2024-07-06 10:07:51 UTC

If you have an index on the uuid anyways having a separate big serial field for PK doesn’t help that much.

rezonant

1 replies

6h33m

2024-07-06 11:55:19 UTC

As mentioned elsewhere, it ensures the ability to perform resumable and consistent batching queries across the data set without missing records.

Ordering over an insertion timestamp is not enough if two records may have the same timestamp: You may miss a record (or visit a record twice) across multiple queries.

disneycember

0 replies

5h7m

2024-07-06 13:21:08 UTC

This is solved sorting by timestamp first then by random PK UUID. Don't think a little simpler batch queries justify leaking time and quantity information or complexity of handling two types of IDs.

stoperaticless

0 replies

8h57m

2024-07-06 09:30:54 UTC

Amen (or similar)

arp242

7 replies

14h48m

2024-07-06 03:40:36 UTC

For almost all use cases just showing a UUIDv7 or sequential ID is fine. There are a few exceptions, but it's not the common case.

mewpmewp2

6 replies

14h28m

2024-07-06 03:59:52 UTC

How would it be fine, e.g. for e commerce which is arguably very large portion of the use cases?

You would be immediately leaking how many orders a day your business is getting with sequential id.

inopinatus

3 replies

13h23m

2024-07-06 05:05:16 UTC

There are jurisdictions e.g. Germany in which a consecutive sequence for invoice numbers is a mandatory, legislated requirement (mercifully, gaps are generally permitted, with caveats)

For extra spice, in some places this is legislated as a per-seller sequence, and in others as a per-customer sequence, so there’s no policy you can apply globally, and this once again highlights the separation of concerns between a primary key and a record locator/identifier.

wodenokoto

1 replies

12h43m

2024-07-06 05:45:00 UTC

consecutive sequence for invoice numbers is a mandatory, legislated requirement (mercifully, gaps are generally permitted, with caveats)

That’s surprising. In Denmark gaps are not allowed. You have to account for all invoices and if you have an invoice numbered 50, then you have at least 50 invoices to account for.

withinboredom

0 replies

5h31m

2024-07-06 12:56:55 UTC

It's nice when you change invoicing software, to be able to have gaps. For example, before Stripe is invoice <500 and Stripe invoices have >500. This makes it simple for humans to determine where an invoice may be located during the transition year. Further, it means we can plan the entire switch-over in advance, vs. only knowing the invoice number AFTER the switch-over. This makes a huge difference in internal communications to customer support, because you can let them know how things will operate once the switch is done. If you can't have gaps, you won't know how to tell customer support where to find new/old invoices until after the switch.

JodieBenitez

0 replies

12h45m

2024-07-06 05:43:30 UTC

There are jurisdictions e.g. Germany in which a consecutive sequence for invoice numbers is a mandatory

Same in France. I thought it was a pretty common requirement.

arp242

1 replies

13h33m

2024-07-06 04:54:55 UTC

You would be immediately leaking how many orders a day your business is getting with sequential id.

Which is fine for almost all of them. All brick and mortar stores "leak" this too; it's really not that hard to guess number of orders for most businesses, and it's not really a problem for the overwhelming majority.

And "Hi, this is Martin, I'd like to ask a question about order 2bf8aa01-6f4e-42ae-8635-9648f70a9a05" doesn't really work. Neither does "John, did you already pay order 2bf8aa01-6f4e-42ae-8635-9648f70a9a05" or "Alice, isn't 2bf8aa01-6f4e-42ae-8635-9648f70a9a05 the same as what we ordered with 7bb027c3-83ea-481a-bb1e-861be18d21ea?"

Especially for order IDs UUIDs are huge PITA because unlike user IDs and other more "internal" IDs, people can and do want to talk about them. You will need some secondary human-friendly unique ID regardless (possibly obfuscated, if you really want to), and if you have that, then why bother giving UUIDs to people?

badestrand

0 replies

11h17m

2024-07-06 07:10:44 UTC

Best solution is to have a serial identifier internally and a generated ID for external. And yes it shouldn't be a UUID as they are user-hostile, it should be something like 6-10 letters+digits.

OskarS

6 replies

11h32m

2024-07-06 06:56:34 UTC

Can I ask (as a humble application developer, not a backend/database person), if the two requirements are:

1. The UUIDs should be ordered internally, for B-tree performance

2. The UUIDs should not be ordered externally, for security reasons

Why not use encryption? The unencrypted ID is a sequential id, but as soon as it leaves the database, it's always encrypted. Like, when getting it out:

    SELECT encrypt(id) FROM table WHERE something = whatever;

and when putting stuff in:

    UPDATE table SET something = whatever WHERE id = decrypt(<encrypted-key>)

Seems like the best of both worlds, and you don't need to store separate things.

spencerap

4 replies

10h51m

2024-07-06 07:36:56 UTC

If the key and encryption mechanism are ever leaked, those opaque external IDs can be converted easily back to sequence numbers, and vice versa, which might pose a risk for you or your users. You won't be able to rotate the encryption key without breaking anything external that tracks those encrypted IDs... third party services, SEO, user bookmarks, etc.

OskarS

3 replies

10h42m

2024-07-06 07:45:59 UTC

You store the key in the database, right? Like, if the database leaks, it doesn’t matter if your ids are sequeneced or unsequenced, because all data has leaked anyway. The key leaking doesn’t seem like a realistic security issue.

zxexz

1 replies

9h50m

2024-07-06 08:37:52 UTC

Ideally if you do this, you store the key in a separate schema with proper roles so that you can call encrypt() with the database role, which can't select the key. Even then, the decrypted metadata should not be particularly sensitive - and should immutably reference a point in time so you can validate against some known key revocation retroactively.

My take is it's rarely necessary to have a token, that you give to an external entity, that has any embedded metadata all - 99.9% of apps aren't operating at a scale where even a million-key hashmap sitting in ram and syncing changes to disk on update would cause any performance difference.

OskarS

0 replies

7h19m

2024-07-06 11:09:17 UTC

This is a very weird thread: half the people are arguing that having these timestamps is not a realistic security problem at all, and the other half is arguing that any fix to it has to have Fort Knox level security policies.

It seems to me: the actual value of knowing these ids/timestamps to a hacker is tiny, but it's not nothing (German tank problem and all that). Like, if a hacker was able to decode the timestamps, it's not ideal, but it's not like a catastrophe either (especially given that half the people in this thread thinks it has no security value at all). Given that threat model, a simple scheme like I suggested seems fine to me.

stoperaticless

0 replies

8h53m

2024-07-06 09:34:49 UTC

The key leaking doesn’t seem like a realistic security issue.

But it is.

If you have a password in a system, you want to rotate it regularly or at least have that ability (for example, when angry colleague leaves).

fauigerzigerk

0 replies

10h42m

2024-07-06 07:46:35 UTC

>Why not use encryption?

Because then you have a key management problem, which adds complexity.

thiht

5 replies

10h32m

2024-07-06 07:55:47 UTC

I don’t understand how that’s an issue. Do you have an example of a possible attack using UUIDv7 timestamp? Is there evidence of this being a real security flaw?

whynotmaybe

1 replies

5h11m

2024-07-06 13:17:14 UTC

The draft spec for uuid v7 has details about the security considerations : https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

The way I see it is that uuid v7 in itself is great for some use but not for all uses.

You always have to remember that a v7 always carries the id's creation time as metadata with it, whether you want it or not. And if you let external users get the v7, they can get that metadata.

I'm not a security expert but I know enough to know that you should only give the minimal data to a user.

My only guess is that v7 being so new, attacks aren't widespread for now, and I know why the author decided not to focus on "if UUID is the right format for a key", because the answer is no 99% of the time.

thiht

0 replies

3h37m

2024-07-06 14:51:15 UTC

That just seems overly cautious. I’d rather use UUIDv7 unless I have a reason not to. The convenience of sortable ids and increased index locality are very much worth the security issues associated with UUIDv7. Maybe I wouldn’t use UUIDv7 for tokens or stuff like that, but DB IDs seem pretty safe.

cqqxo4zV46cp

1 replies

5h34m

2024-07-06 12:53:43 UTC

I don’t understand this thinking. If you understand what’s at play, you can infer the potential security implications. What you’re advocating for is being entirely reactive instead of also being proactive.

thiht

0 replies

3h44m

2024-07-06 14:44:09 UTC

No, I don’t. Even with a timestamp uuids are not enumerable, and honestly I don’t care that the timestamp they were created at is public. Is the version of uuid used being a part of the uuid considered a leak too?

timmytokyo

0 replies

1h32m

2024-07-06 16:55:55 UTC

I don't get it either. If UUIDv7 lacks security due to its revelation of a timestamp, why don't bigserials also lack security? After all, given a bigserial ID, you can tell whether it was generated before or after some other bigserial ID and thereby infer something about the time it was generated.

sebazzz

0 replies

11h49m

2024-07-06 06:39:07 UTC

As uuid v7 hold time information, they can help bad actors for timing attacks or pattern recognition because they contain a time information linked to the record.

Are you then not doing security by randomness if that is the thing that worries you?

mixmastamyk

2 replies

14h2m

2024-07-06 04:25:43 UTC

You're saving storage space but potentially leaking details. Is that ok for your application? No one can answer but your org.

hoffs

1 replies

10h37m

2024-07-06 07:50:41 UTC

The details part is so miniscule that I doubt it even matters. You'd have difficult time trying to enumerate uuidv7s anyways.

mixmastamyk

0 replies

8h11m

2024-07-06 10:17:36 UTC

Leaking time leaks information about customer growth and usage. It may matter to your competitors.

dajtxx

1 replies

19h15m

2024-07-05 23:12:42 UTC

The comment above warns against it due to the embedded timestamp info as a info leak risk. Perhaps that was a problem for them in some circumstance.

inopinatus

0 replies

13h31m

2024-07-06 04:57:33 UTC

It wasn’t a problem for me directly but was observed and related by a colleague: an identifier for an acquired entity embedded the record’s creation timestamp and effectively leaked the date of acquisition despite it being commercial-in-confidence information. Cue post-M&A ruckus at board level.

Just goes to show you can’t inadvertently disclose anything these days.

solidsnack9000

9 replies

19h17m

2024-07-05 23:11:22 UTC

One challenge with PNR is actually restricting the alphabet appropriately. They sure are easy to say aloud -- just five or six letters in many cases -- but how do you ensure you have (a) enough letters to get a reasonable bitwidth and (b) not form ridiculous words?

dmurray

3 replies

19h4m

2024-07-05 23:23:48 UTC

Take all the Roman alphabet apart from the vowels - 21 characters and length 6 gives you 100 million possibilities which is plenty for most applications.

You can still get vaguely offensive sequences like FKNNGR or BLKCNT, but at some point you have to put this down not to your software being offensive or hateful but to humans finding patterns in randomness.

solidsnack9000

1 replies

16h31m

2024-07-06 01:56:56 UTC

Removing Y turns out to be important, as well...

alex_duf

0 replies

11h19m

2024-07-06 07:09:19 UTC

It's still bothering me that Y is a vowel in french but not in English. Can we all agree on something for once?

thwarted

0 replies

18h31m

2024-07-05 23:56:45 UTC

Use Crockford encoding to reduce this. It has readability and dictation considerations too which makes it ideal for things humans will interface with.

vbezhenar

2 replies

17h53m

2024-07-06 00:35:03 UTC

Do you ensure that your software does not form ridiculous numbers? Imagine that some christian person gets "666" number. What a scandal.

Do you ensure that your software does not form ridiculous words in every language? Or just another US-centric thing?

The idea of avoiding identifiers to be ridiculous is ridiculous to me, honestly...

solidsnack9000

0 replies

16h32m

2024-07-06 01:56:08 UTC

This is just not going work in a customer centric organization. If SABRE gave out PNRs like, I don't know, SEXGOD, it would make customers angry, and that is not even with any curse words. They are heavily filtered.

https://onemileatatime.com/insights/funny-airline-pnr/

alex_duf

0 replies

11h18m

2024-07-06 07:10:07 UTC

You can also get 666 in a quid for what it's worth

codetrotter

1 replies

19h6m

2024-07-05 23:21:40 UTC

not form ridiculous words

Depends on what you mean by ridiculous.

For example https://sqids.org/ ensures that there are no profanities in the generated ids. And it allows you to add additional words that you want to avoid.

solidsnack9000

0 replies

16h30m

2024-07-06 01:57:40 UTC

It does but it allows common names, the name of the deity, common words, &c. I suspect you have to do something like an earlier poster suggested -- strip out all the vowels to start with...

saaspirant

6 replies

16h37m

2024-07-06 01:50:56 UTC

Naive question. Above comment suggests using bigserial as internal identifier and uuid as public facing ID. Now let's say there's a user table and post table. Both will have only uuid available in the APIs. So every time API requests a post of the user or user of the post, we will find the the relevant row using uuid right? Since uuid will be sent by the public facing APIs? How would bigserial be used here?

I don't know much databases in general BTW..

Edit: Can I use uuidv4 as PK since it doesn't leak timestamp data? Why bigserial + uuid7 is better?

jpollock

5 replies

16h23m

2024-07-06 02:04:56 UTC

Each object has an external key and an internal key. This separation allows you to migrate to other layouts, technologies, etc. without breaking your customer's links or records.

Internally, your database looks like:

  User
    ID - uint128
    external_id - UUID (of some sort)
    name - string

  Post
    ID - uint128
    UserId - uint128 (User.ID)
    external_id - UUID
    ...

Then you have secondary indices on the external_id columns in both tables, and the UserId column on Post. You can then join from one to the other using the ID column.

e.g.

  SELECT count(*) FROM
   Post JOIN User 
   ON User.ID = Post.UserID
  WHERE 
   User.external_id = <%USERID_FROM_REQUEST%>;

  // Don't forget to bind USER_ID_FROM_REQUEST instead of 
  // concating the string, no Johny-Tables here!

There should be significant performance benefits from using int's for the joining keys (at least compared to strings), but my experience might be old.

TeeWEE

2 replies

15h2m

2024-07-06 03:26:37 UTC

So it can’t use the internal id index, result: slow lookups for external ids.

bfung

1 replies

14h37m

2024-07-06 03:51:10 UTC

secondary indices on the external_id columns in both tables

pick your poison, slower lookup or more disk usage

masklinn

0 replies

11h30m

2024-07-06 06:57:46 UTC

There are other fun ideas to explore e.g. if your external id is a hash you can rebuild, it might be an option to scavenge the full hash but keep the first few bytes of it, indexed. That increases lookup costs if the hash does not exist or has been scavenged but lowers storage.

In fact if your external ids are evenly distributed you can do that anyway, only index the few leading / trailing bytes of the id. That will increase lookup costs a bit but significantly decrease storage.

teddyh

0 replies

14h8m

2024-07-06 04:20:05 UTC

no Johny-Tables here!

It’s “Bobby tables”: <https://xkcd.com/327/>

dgb23

0 replies

12h48m

2024-07-06 05:40:03 UTC

A uuid is just an uint128, or a byte array of the same size. A DB can use them as fixed size keys just fine, and they convert quickly from/into their string representaion.

The problem is indexing and generation.

For example sqlite already generates sequential ids whether you use them or not.

Non sequential uuids suck for indexing as the table grows. There are some benchmarks and explanations to look up.

The benefit is you can merge aggregates from different DB instances without a fuss. And you can generate uuids without asking your DB.

brycelarkin

6 replies

19h12m

2024-07-05 23:16:07 UTC

For postgres, you want to use “bigint generated always as identity” instead of bigserial.

westurner

3 replies

16h52m

2024-07-06 01:36:26 UTC

Why “bigint generated always as identity” instead of bigserial, instead of Postgres' uuid data type?

Postgres' UUID datatype: https://www.postgresql.org/docs/current/datatype-uuid.html#D...

django.db.models.fields.UUIDField: https://docs.djangoproject.com/en/5.0/ref/models/fields/#uui... :

class UUIDField: A field for storing universally unique identifiers. Uses Python’s UUID class. When used on PostgreSQL and MariaDB 10.7+, this stores in a uuid datatype, otherwise in a char(32)

[...] Lookups on PostgreSQL and MariaDB 10.7+: Using iexact, contains, icontains, startswith, istartswith, endswith, or iendswith lookups on PostgreSQL don’t work for values without hyphens, because PostgreSQL and MariaDB 10.7+ store them in a hyphenated uuid datatype type.

From the sqlalachemy.types.Uuid docs: https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqla... :

Represent a database agnostic UUID datatype.

For backends that have no “native” UUID datatype, the value will make use of CHAR(32) and store the UUID as a 32-character alphanumeric hex string.

For backends which are known to support UUID directly or a similar uuid-storing datatype such as SQL Server’s UNIQUEIDENTIFIER, a “native” mode enabled by default allows these types will be used on those backends.

In its default mode of use, the Uuid datatype expects Python uuid objects, from the Python uuid module

From the docs for the uuid Python module: https://docs.python.org/3/library/uuid.html :

class uuid.SafeUUID: Added in version 3.7.

safe: The UUID was generated by the platform in a multiprocessing-safe way

And there's not yet a uuid.uuid7() in the uuid Python module.

UUIDv7 leaks timing information ( https://news.ycombinator.com/item?id=40886496 ); which is ironic because uuids are usually used to avoid the "guess an autoincrement integer key" issue

avg_dev

2 replies

16h15m

2024-07-06 02:13:05 UTC

Just noting, the commenter you replied to said:

use “bigint generated always as identity” instead of bigserial.

The commenter you are replying to was not saying anything about whether to use UUIDs or not; they just said "if you are going to use bigserial, you should use bigint generated always as identity instead".

westurner

1 replies

3h18m

2024-07-06 15:10:05 UTC

The question is the same; why would you use bigint instead of the native UUID type?

Why does OT compare text and UUID instead of char(32) and UUID?

What advantage would there be for database abstraction libraries like SQLalchemy and Django to implement the UUID type with bigint or bigserial instead of the native pg UUID type?

Merad

0 replies

2h15m

2024-07-06 16:13:06 UTC

Best practice in Postgres is to use always use the text data type and combine it with check constraints when you need an exact length or max length.

See: https://wiki.postgresql.org/wiki/Don't_Do_This#Text_storage

Also, I think you're misunderstanding the article. They aren't talking about storing a uuid in a bigint. They're talking about have two different id's. An incrementing bigint is used internally within the db for PK and FK's. A separate uuid is used as an external identifier that's exposed by your API.

thiht

0 replies

10h30m

2024-07-06 07:58:20 UTC

Why?

inopinatus

0 replies

14h2m

2024-07-06 04:26:34 UTC

I agree and apologise for writing bigserial out of ossified habit. Identity columns hew to the SQL standard, bigserial is a pg-ism. I actually do still use it when the ORM’s DDL generator prefers as much (looking at you Rails), there’s no gain from fighting the middleware.

For other readers: the two are almost the same in behaviour, they differ primarily in that identity columns don’t need a USAGE grant on their associated sequence, and the ALWAYS makes it harder (but despite the term not actually impossible) to override the default value on row insert.

TeeWEE

3 replies

15h14m

2024-07-06 03:14:12 UTC

If you have distributed data creation. (Creating data on the client). And a CRDT style mechanism for syncing, then you can’t use bigserial because of the simple fact that it is sequential. The best solution here is uuidv7. Since you can generate these at the client even when offline.

arp242

2 replies

14h49m

2024-07-06 03:38:43 UTC

That's not true, you can increment by 2, 10, 100, or any number. I'm not saying that's necessarily the best solution, but it's not true that you can't use it.

azinman2

1 replies

14h34m

2024-07-06 03:54:04 UTC

But if you’re distributed or offline incrementing by an arbitrary amount can still create collisions unless you’re willing to increment by very very large amounts at random, in which case you’ve effectively reinvented uuid

arp242

0 replies

14h4m

2024-07-06 04:23:59 UTC

No, every server will have its own series:

  Server 1: [1 4 7]
  Server 2: [2 5 8]
  Server 3: [3 6 9]

Or whatever is appropriate. You can even add/change this later on.

Again, I'm not saying it's necessarily the best solution, I'm just saying it's possible. I'm not really interested in a long discussion about uuid vs. serial because it's been done to death a million times. Previous poster said it's impossible and that's just flat-out wrong.

necovek

1 replies

14h40m

2024-07-06 03:48:03 UTC

I've never played enough with UUIDs in Postgres, but I wonder if you could publicly expose only the random bits (i.e. the second half if an UUIDv7) and have another non-b-tree index on SUBSTR(id, 16) for quick lookups based on that value. Similar is done for "short commit hashes" too.

Though I would wonder why go with UUIDs in that case at all?

mjevans

0 replies

14h27m

2024-07-06 04:00:48 UTC

Offhand, I remember _reading_ about that but haven't ever used it in practice so please test and confirm...

# Postgres can Index Function Results (including what you described)

# Postgres does not magically know to USE those results. To use them you must E.G. JOIN (probably works for WHERE?) or ORDER or LIMIT by the same parameter that went into the Index (which is how it's matched up).

Generally, the Primary Key should either be 'small' (BigInt can count as such) and only used internally or be something strongly relevant to how the records in that table can be limited for most (or the most) queries.

darby_nine

1 replies

14h54m

2024-07-06 03:33:57 UTC

You really have to understand the usage patterns of a program to give a recommendation. Anyone who doesn't is just a grifter.

I don't trust anyone who gives recommendations sight unseen.

bruce511

0 replies

13h59m

2024-07-06 04:29:19 UTC

100% this. Context is everything, and advice without understanding context should be suspect at best.

That said, I've found that contexts can change. Personally I started with the database being local (send single). It the migrated to remote (web), after that to distributed and from there to mobile.

As contexts gave changed I've found we needed to switch to UUID (and the switch was a pain.) A lot of our problems would simply not exist if we'd used uuid from the start.

So now my "default advice" is uuid, unless there's a -specific- reason not to. For me, performance is not that reason (your database outlives multiple generations of hardware, and faster hardware will dwarf optimisation-via-data-type.)

And don't get me started on natural versus surrogate identifiers- that ship sailed decades ago.)

rattray

0 replies

2h44m

2024-07-06 15:43:56 UTC

Stripe are lying when they say their IDs are random

Where does Stripe make that claim?

I think most of the value of Stripe-style IDs is in their other properties, like the information they bear, readability, copy-pasteability (which uuids lack - double-click to select often stops at hyphens), etc.

pin24

0 replies

8h16m

2024-07-06 10:12:19 UTC

Just out of curiosity, why string-encoded UUIDs? A native/binary UUID column type should not take more than 16 bytes, whereas its hexadecimal string representation requires 36 bytes (if we assume 1 byte per character). That would result in an index size more than twice as big as required.

olalonde

0 replies

14h3m

2024-07-06 04:25:29 UTC

They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service.

Could you share this analysis? Seems interesting.

nextaccountic

0 replies

19h45m

2024-07-05 22:43:27 UTC

Why string encoded column? Is it just to make the table bigger?

Why not just use the UUID type??

fabian2k

23 replies

22h1m

2024-07-05 20:27:27 UTC

My somewhat naive understanding was that random UUIDs were not that big of a deal in Postgres because it does not cluster by primary key. And of course a UUID (16 bytes) is larger than a serial (4 bytes) or bigserial (8 bytes) by a factor of 2-4 . This certainly might matter for an index, but on a whole table level where you have 20+ bytes overhead per row this doesn't seem that big of a deal for anything except very narrow tables with large row counts.

So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?

oppositelock

10 replies

21h47m

2024-07-05 20:40:50 UTC

Random UUID's are super useful when you have distributed creation of UUID's, because you avoid conflicts with very high probability and don't rely on your DB to generate them for you, and they also leak no information about when or where the UUID was created.

Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal. It does well enough for all practical purposes if you need randomness.

sgarland

7 replies

16h37m

2024-07-06 01:51:25 UTC

Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal.

It literally is the one thing in the entire stack that must always be happy. Every stateful service likely depends on it. Sad DBs means higher latency for everyone, and grumpy DBREs getting paged.

matharmin

4 replies

8h59m

2024-07-06 09:28:39 UTC

Postgres is usually completely happy enough with UUIDv4. Overall architecture (such as allowing distributed id generation, if relevant) is more important than squeezing out that last bit of performance, especially for the majority of web applications who don't work with 10 million+ rows.

sgarland

3 replies

4h56m

2024-07-06 13:32:35 UTC

If your app isn’t working with billions of rows, you really don’t need to be worrying about distributed anything. Even then, I’d be suspicious.

I don’t think people grasp how far a single RDBMS server can take you. Hundreds of thousands of queries per second are well in reach of a well-configured MySQL or Postgres instance on modern hardware. This also has the terrific benefit of making reasoning about state and transactions much, much simpler.

Re: last bit of performance, it’s more than that. If you’re using Aurora, where you pay for every disk op, using UUIDv4 as PK in Postgres will approximately 7x your IOPS for SELECTs using them, and massively (I can’t quantify it on a general basis; it depends on the rest of the table, and your workload split) increase them for writes. That’s not free. On RDS, where you pay for disk performance upfront, you’re cutting into your available performance.

About the only place it effectively doesn’t matter except at insane scale is on native NVMe drives. If you saturate IOPS for one of those without first saturating the NIC, I would love to see your schema and queries.

wibblewobble125

1 replies

4h19m

2024-07-06 14:09:27 UTC

Sometimes distribution is not for performance but tenant isolation for regulatory or general isolation purposes. I work in such an industry.

sgarland

0 replies

4h7m

2024-07-06 14:21:29 UTC

Fair point. You can still use monotonic IDs with these, via either interleaving chunks to each DB, or with a central server that allocates them – the latter approach is how Slack handles it, for example.

fovc

0 replies

4h49m

2024-07-06 13:39:13 UTC

Scale isn’t the only reason to have distributed systems. You could very well have a tiny but distributed system

SoftTalker

1 replies

15h37m

2024-07-06 02:51:23 UTC

DBRE? I guess DBA is too old fashioned for the cool kids?

sgarland

0 replies

15h24m

2024-07-06 03:03:47 UTC

Listen, I didn't make the title up, I just grabbed onto it from the SRE world because I love databases.

There are some pragmatic differences I've found, though - generally, DBAs are less focused on things like IaC (though I know at least one who does), SLIs/SLOs, CI/CD, and the other things often associated with SRE. So DBRE is SRE + DBA, or a DB-focused SRE, if you'd rather.

globular-toast

0 replies

9h22m

2024-07-06 09:06:31 UTC

The whole point of UUID is distributed creation. There's nothing about random ones (UUIDv4) that makes it better for this purpose.

Spivak

0 replies

18h20m

2024-07-06 00:08:18 UTC

Random UUID's are super useful when you have distributed creation of UUID's, because you avoid conflicts with very high probability and don't rely on your DB to generate them for you

See Snowflake IDs for a scheme that gives you the benefit of random UUIDs but are strictly increasing. Which is really UUIDv7 but fits in your bigint column. No entropy required.

stouset

3 replies

17h9m

2024-07-06 01:18:56 UTC

UUIDv7 are good enough that you can just avoid thinking and use it as a “default”. The worst consequences of doing so are some slightly impacted performance and leak of some timing information but these are extremely minor drawbacks.

Using completely random UUIDs is IMO the worst choice. It’s fine right up until it isn’t, and then you are stuck in hell with no good way out.

groestl

2 replies

4h26m

2024-07-06 14:01:38 UTC

Using completely random UUIDs is IMO the worst choice

Can't say that without context. I've worked in systems where even the version bits were randomized, and for good reason (i know, technically, no UUID anymore).

stouset

1 replies

3h55m

2024-07-06 14:33:16 UTC

I’m talking about default choices. UUIDv7 has few significant downsides for most cases. UUIDv4 has serious downsides. There are obviously cases where the latter is appropriate but that will come from unique requirements.

groestl

0 replies

3h33m

2024-07-06 14:55:09 UTC

How about secure by default?

munk-a

2 replies

21h7m

2024-07-05 21:20:40 UTC

To:

Am I wrong here and this is something that really matters and you should invest more time in?

Specifically, no - you don't need to worry about it. Reconfiguring your tables to use a different style of unique identifier if your tables have a unique identifier is a bit of a pain but no more so than any other instance of renaming a column - if you want to minimize downtime you add the new column, migrate data to the column, deploy code that utilizes the new column and then finally retire the old column. Even if the previous version of the table lacked any sort of unique key it is still possible to add one after the fact (it's a bit technically harder to properly keep them in sync but it is possible to do safely).

It's just a question of the cost of doing so and the benefits of it - I work in a system that exclusively uses integral keys and our data is such that we don't really suffer any downsides from that choice - if you're working in a larger system with less confidence in the security practices of other teams then avoiding sequential keys so that you have obscurity to fall back on if someone really drops the ball on real security isn't the worst idea... but I think the really compelling reason to prefer UUIDs is for the power of distributed generation... that really only applies to inherently decentralized or astoundingly large products though - and if your product eventually grows to astoundingly large you'll have plenty of time to switch first (probably the wake-up call will be closing in on running out of 4 bit serial unique keys).

big_whack

1 replies

20h36m

2024-07-05 21:52:29 UTC

Reconfiguring tables to use a different kind of unique ID (primary key in this context) can be a much bigger pain than an ordinary column rename if it is in use by foreign key constraints.

snicker7

0 replies

15h44m

2024-07-06 02:43:49 UTC

OR if the primary key is exported out of the DB, i.e. for constructing URLs.

sgarland

0 replies

17h57m

2024-07-06 00:31:31 UTC

You still get performance hits from Visibility Map lookups, and WAL bloat.

As a DBRE, I believe it always matters, and you should invest time in it. Pragmatically, it is unlikely to have noticeable effects until your tables are at least in the 1E5 rows range, if not higher. Unfortunately, by that point, it’s likely that you’ll find other things of higher importance to deal with, so the problem will only grow.

nextaccountic

0 replies

5h56m

2024-07-06 12:31:47 UTC

Postgres doesn't automatically reorder rows by clustering on a primary key but you can perform clustering manually, and if the table isn't completely filled it can attempt to maintain rows in order (as long as there is room for it)

https://www.postgresql.org/docs/current/sql-cluster.html

masklinn

0 replies

21h53m

2024-07-05 20:35:02 UTC

While Postgres does not cluster table, uuids do affect indexes significantly, random insertion causes increased index slack which translates to cache bloat and thus longer traversal and lower cache residency.

kevin_nisbet

0 replies

21h26m

2024-07-05 21:01:47 UTC

So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?

Like any sort of optimization I believe this will depend on your workload and what's important.

For me when I switched to UUIDv7 a few months ago, it was basically no effort to switch v4 to v7 on a relatively new system. I was observing much higher batch insertion latencies than I expected, and producing inserts that touch less of the btree on an index created a very noticeable reduction in insertion latencies. But my workloads and latencies may look nothing like yours. On amazon RDS instances with EBS volumes and relatively low memory, insertion latency stood out, so using strategies that reduce the number of disk blocks that are needed has an outsized performance impact.

This of course would produce different results on different hardware / system sizing.

feydaykyn

0 replies

9h14m

2024-07-06 09:13:54 UTC

A dev at $WORK thought it wasn't a big deal too because they "tested" on their unrestricted dev setup with 3 rows in the table.

Of course production crashed when their latest sql query did a multijoin on the real table with a few millions of rows. The size of the uuid needed to join filled the available RAM, everything slowed to a crawl and the system collapsed.

The uuid as primary key can be seen as a risk multiplicator : it will amplify any performance issue you may have, converting a temporary slowness into a full system stop.

declan_roberts

21 replies

19h8m

2024-07-05 23:20:26 UTC

Call me old fashion but I really like integer autoincrement primary keys. It's easy to understand and obviously simple to sort. Furthermore when working on large batch projects you can just store the last primary key as your high water mark and get everything greater than that.

I suppose TSID works to this end, but certainly more complicated.

tommy_axle

9 replies

17h49m

2024-07-06 00:38:42 UTC

It is simpler but like everything else it depends on the application. For a private app you can generally get away with it. Something that's more public facing? The ID will most likely leak information. As systems mature and you introduce things like replication, having IDs that are more universal starts looking good. In general, starting off with a uuid like uuid v4 or nanoid is a good bet.

lazide

8 replies

17h36m

2024-07-06 00:51:48 UTC

The info leak, specifically, is [https://en.m.wikipedia.org/wiki/German_tank_problem].

sgarland

7 replies

16h32m

2024-07-06 01:56:16 UTC

Oh no, someone might know the number of customers, or the rate of signups.

Traditional businesses can figure this out by sitting in the parking lot. Why SaaS has decided it’s a huge problem is beyond me.

lazide

2 replies

14h33m

2024-07-06 03:55:20 UTC

By sitting in every parking lot, yes. Which requires physical presence. And hedge funds do indeed do it against some targets specifically as a leg up, and pay quite a bit of money to do so, presumably because it is worth it to them.

It certainly helped the Allies in the war, as previous intelligence had the rate of tank production much higher, and they were expending a lot of effort trying to exceed that previous false number.

Information is power, etc. etc.

As to if it's worth it for your SaaS to mitigate is up to you of course. I know I notice when things like Invoice #'s, my userID, customerID, etc. get shown, and it's a small number, or small delta. And that impacts my behavior. Does it also impact someone like a VC’s behavior? Or a competitor? Unknown.

But Hell, a bunch of people right now are probably launching SaaS's with NO-OP auth flows for documents or images, or with trivial external SQL injection flaws, which is a way bigger problem.

But it doesn't make the vulnerability/problem non-existent. Also feel free to use 16 bit auto incrementing primary keys for all your core tables if you want.

But if we're in an engineering discussion on the topic, knowledgeable folks will bring up the issues with it, because that is what they do. Most would feel it is their duty to do it, even.

sgarland

0 replies

5h6m

2024-07-06 13:22:09 UTC

But if we're in an engineering discussion on the topic, knowledgeable folks will bring up the issues with it, because that is what they do.

To be clear, this is not directed at you specifically, and I have no idea what your level of expertise is on anything.

In general, I’ve found HN commenters level of knowledge to be fairly bimodal. They’re either regurgitating things they read on a Medium blog, or they really know their shit.

Every time this topic comes up, people delightedly mention the German Tank Problem, but I have never, not once, seen anyone post an actual example of when a modern business got rekt by a competitor using knowledge gained from monotonic IDs.

Re: security viz. AuthZ, my stance is the same as SQL injection – it’s such a trivially easy problem to avoid that it shouldn’t be a consideration for harming the performance of the DB. (Not that SQL injection mitigations cause performance impacts; sorry, that analogy didn’t work well)

akira2501

0 replies

42m

2024-07-06 17:45:52 UTC

Which requires physical presence.

It requires physical access. Presence can be achieved by placing a dash camera or window camera strategically.

and they were expending a lot of effort trying to exceed that previous false number.

The idea here being that if the allies just made more tanks than the axis they would automatically win? It's a lionized story of intelligence agency cleverness built squarely on the back of insane military "strategy."

karmajunkie

1 replies

12h58m

2024-07-06 05:30:15 UTC

i mean, you might care if the investors you’re trying to woo for that bridge round figure out your churn is a lot higher than you’re willing to admit… or worse, your traction is terrible.

sgarland

0 replies

5h17m

2024-07-06 13:11:08 UTC

How is a monotonically incrementing integer going to reveal churn or traction? You’re not reusing IDs.

bruce511

1 replies

13h39m

2024-07-06 04:49:01 UTC

Yes, sure, it leaks some information - but to be fair printing an invoice also leaks information.

For me the priority is security. If I get a link (visible or invisible) that contains a numeric ID, there's the possibility to tweak that link with another number.

Ideally, the server treats that number as suspect. Every. Single . Time. In practice I only need one developer to miss the check in one place and I have a serious security issue. Like leaking my entire customer list serious. This has happened multiple times -that we know of-.

Yes obscurity is not security. But sequential numbers are insecure by design, and it takes a lot of well-maintained walls to protect them. One crack leaves them very vulnerable.

sgarland

0 replies

5h13m

2024-07-06 13:15:33 UTC

Call me naïve, but surely you can have fuzzing tests in CI?

I’m also going to use this as yet another example of why getting rid of QA in favor of Ship It Now was a bad idea.

laserDinosaur

8 replies

4h59m

2024-07-06 13:29:07 UTC

Call me old fashion but I really like integer autoincrement primary keys.

Just hope you never have to merge tables from two databases together.

hu3

7 replies

4h39m

2024-07-06 13:49:27 UTC

This is exceptionally rare in most projects.

I know of only one person in my entire career that had to do this. And they managed it just fine despite working with auto-incrementing big ints.

Yet some folks advocate that all projects should pay an expensive insurance against this elusive event of two databases being merged.

laserDinosaur

5 replies

4h19m

2024-07-06 14:08:53 UTC

And they managed it just fine despite working with auto-incrementing big ints.

I wonder how. I've had to do several big merges in my career, and it was always a nightmare because of all the external systems which were already referencing and storing those pre-existing ints. Sure, merging the databases is easy if you don't mind regenerating all the Id's, but it's not usually that simple.

grey-area

4 replies

3h54m

2024-07-06 14:34:25 UTC

Simplest way is to keep the identifiers from DB A and increment all the identifiers from DB B by an offset. Third parties complicates things of course but internally it can be pretty simple, so maybe they just didn't have too many third parties using the IDs.

hu3

3 replies

3h46m

2024-07-06 14:41:50 UTC

That was it if I recall.

They wrote a small script with the logic involved in the merging. PKs and FKs of only one database had to be incremented by an offset of max(table.pk) + safe margin.

They did this for each table.

Once this script was tested multiple times with subsets of each database, they stopped production and ran the script against it (with backup fallbacks). A small downtime window in a Sunday.

And that was it. The databases never had to pay the UUID tax, before or after.

laserDinosaur

2 replies

3h3m

2024-07-06 15:25:09 UTC

they stopped production

Oh I see, we're talking about two entirely different worlds here, lol.

hu3

0 replies

2h20m

2024-07-06 16:08:01 UTC

Not being able to stop production database for a very short window once in a lifetime is another exceptionally rare business case.

I've seen architecture astronauts make their business pay unreasonable tech insurances by adding complexity to avoid simply pausing production for some minutes when it could have been much cheaper this way.

And from my understanding, in the case I mentioned, they chose to stop production to simplify the process. But they didn't have to.

A mixture of replication plus code changes to write in two databases could also have solved the issue.

Most business die because they can't move fast enough. Not because their production database stopped for a few minutes.

grey-area

0 replies

2h3m

2024-07-06 16:25:19 UTC

Stopping production on db B isn't really a requirement, just makes it easier.

VeejayRampay

0 replies

1h48m

2024-07-06 16:39:57 UTC

I swear, it really reads like "oh you like SOME TECHNOLOGY? we'll se how you like it when FARCICALLY RARE EVENT happens

foreigner

1 replies

10h18m

2024-07-06 08:09:49 UTC

Be advised using the last auto-incremented value as a "high water mark" and getting everything greater than that is not 100% reliable, because in some scenarios auto-incrementing values can be written out of order.

Koffiepoeder

0 replies

4h31m

2024-07-06 13:57:25 UTC

Yup, not many people seem to know about CACHE vals and/or sequence reservation. It's a problem that I've had to explain a couple of times already to colleagues. And unfortunately hidden assumptions are oft the most dangerous kind.

netcraft

14 replies

21h18m

2024-07-05 21:10:27 UTC

Another day, another article saying not to use UUIDs as PKs. I've maintained systems using UUIDs stored as char(36) with million record tables without issue - This is not an endorsement, just explaining that this is bikeshedding. Should you use v7 when you can? Sure. Would int/bigint be faster in your benchmarks? Sure. But the benefits totally outweigh the speed differences until you get to a very large system. But instead of worrying about this, spend your energy on a million other things first and then celebrate when UUIDs become your bottleneck.

paulddraper

3 replies

20h45m

2024-07-05 21:43:23 UTC

I haven't maintained any sizable database system without any issues, least of all performance ones.

I call BS.

netcraft

1 replies

19h8m

2024-07-05 23:20:24 UTC

lol, I of course don't mean that I have no issues, performance or otherwise. Only that I have never had an issue with UUIDs in postgres

paulddraper

0 replies

17h19m

2024-07-06 01:09:21 UTC

The "issue" is that you're doubling the size of your identifiers, and increasing memory usage of indexes.

And memory is the key factor of any sizable database.

tomnipotent

0 replies

20h3m

2024-07-05 22:24:42 UTC

This gave me a good chuckle, and has generally been my experience. Systems grow often in unpredictable or unintuitive ways.

You can pay the cost for something upfront, and the cost of maintaining it, and in the long term paid too much for something you didn't actually need.

Alternatively you can wait to pay it until you're certain you need it but the work involved has become much more significant, in which it can cost more than it would have to have built and maintained it from the beginning.

Compounding the issue is the build-up-front scenario costs fade with time and you don't really think about them, but build-when-you-need-it always creates a stir even if the costs are less overall than build-up-front.

Either way something will go wrong no matter how many times you predict where the cards will fall.

sgarland

2 replies

16h23m

2024-07-06 02:05:03 UTC

Milliseconds matter, especially when they compound. If your DB can return a SELECT in sub-msec time (to its network boundary, obviously) instead of 10 msec, that adds up when a given page might require a dozen or more trips.

Also, I have never seen devs (PMs, really – devs are the unfortunate souls slogging through tickets) suddenly care about performance-related tech debt. Why would they, when you can just click a button and double your DB’s hardware? Boom, problem solved… until it isn’t. Eventually, you run out of scaling, and since you probably don’t have a DBA/DBRE (else they’d have been screaming at you for months), it’s going to be extremely painful to solve now.

The bare minimum I’m asking – as a DBRE – is to use UUIDv7 and store them in Postgres’ native UUID type. That’s all. That’s an incredibly small amount of effort to put forth.

mewpmewp2

1 replies

13h59m

2024-07-06 04:29:19 UTC

But then by default you are leaking potentially business sensitive data with your id if you are using it as public facing, which is unsecure design by default.

I would rather have secure data by default and opt in to optimise when it is clear this info is fine to leak.

sgarland

0 replies

5h14m

2024-07-06 13:13:55 UTC

See other reply; I don’t believe that exposing this is as big an issue as people think. But even with that, there’s also no reason to do so. Internal ID friendly to the DB and eternal, random ID that does get exposed is a common practice. Or use JWE/JWT, and never show either.

jrochkind1

1 replies

16h58m

2024-07-06 01:30:00 UTC

Thanks, good to hear.

If you are using PG, simply using it's native UUID type instead of char(36) seems like a no-opportunity-cost obvious optimization choice at least though, if you have a choice?

netcraft

0 replies

1h46m

2024-07-06 16:41:38 UTC

yes absolutely- that system was built a long time ago before the uuid type - you should def not store things in chars (and probably shouldnt use char at all, use text)

was saying that even with that poor implementation we still were not having issues using uuids

arp242

1 replies

14h7m

2024-07-06 04:20:41 UTC

A million rows is quite small.

A string will use 36 bytes per row. bigserial will use 8 bytes per row. At 4 billion rows that's about 100G. Now imagine a row with 3 foreign keys to other tables with string UUIDs and you're wasting 300G (vs UUID type) or 400G (vs. bigserial), for no good reason. And doing things like "where id = ?" will be slower. You will be able to keep fewer rows cached in memory. Etc.

It's absolutely not a bikeshed. And migrating all of this later on can be a right pain so it's worth getting it right up-frong.

It's also not more effort to do things right: usually it's exactly the same effort as doing it wrong.

groestl

0 replies

4h12m

2024-07-06 14:15:39 UTC

And migrating all of this later on can be a right pain so it's worth getting it right up-frong

I've never had to move from uuids to integers. I've had to move from integers to uuids plenty of times though.

therealdrag0

0 replies

15h25m

2024-07-06 03:02:42 UTC

Million records isn’t very many. But ya we have tables with billions of records and v4 UUIDs hasn’t been a blocker.

groestl

0 replies

4h15m

2024-07-06 14:13:15 UTC

When UUIDs become your bottleneck.

When UUIDs become your bottleneck you'll be celebrating for picking UUIDs, because now you can move to a distributed architecture and not worry about IDs.

eerikkivistik

0 replies

21h9m

2024-07-05 21:18:41 UTC

I was gonna say... When your system is large enough to run into this specific performance bottleneck, pop a bottle and celebrate, you are making enough money to solve that problem.

While knowing this information is useful, most services fail in different domains and problems way before you reach that point. I'm not sure people really comprehend how hard you can hit a single machine before you need to distribute a workload.

AtlasBarfed

12 replies

21h33m

2024-07-05 20:54:48 UTC

UUIDs are guaranteed to be unique?

They often use tricks like including the MAC address of the generator machine and other ways to increase uniqueness assurances.

It was my understanding that uuids are simply very very unlikely to duplicate in situations with random generation.

munk-a

7 replies

20h45m

2024-07-05 21:42:49 UTC

Your understanding is correct but you're underselling very very in this context. It is astronomically unlikely to hit a collision with the advised generation methods. If you want a possibly easier to grasp parallel git relies on SHA hashes never colliding and will break in a really awful way if you can produce two commits in a tree with the same hash - it's so astoundingly unlikely that people are okay summarizing it as "Never gonna happen" - it certainly will eventually, but it might not happen until the earth is swallowed by the sun.

lulzury

4 replies

19h32m

2024-07-05 22:56:35 UTC

What you stated makes intuitive sense, but it does make me wonder why the RFC states the following in the security considerations:

Implementations SHOULD NOT assume that UUIDs are hard to guess. For example, they MUST NOT be used as security capabilities (identifiers whose mere possession grants access). Discovery of predictability in a random number source will result in a vulnerability.

https://datatracker.ietf.org/doc/html/rfc9562#name-security-...

jrochkind1

2 replies

16h50m

2024-07-06 01:38:13 UTC

I don't know the math here specifically, but being hard to guess is a different quantity than chance of collision when following the algorithm.

That is, if you aren't trying to have a collision and following the algorithm that has that aim, a collision can be exceedingly unlikely; but they can still be easy to guess if you are trying to predict someone else's assignment.

stoperaticless

1 replies

8h38m

2024-07-06 09:50:10 UTC

to illustrate point above:

Sequence that easy to guess, but will never collide until it wraps back to 1.

n1=1,n2=2,…

(Caveat: if generated only by single node/thread)

jrochkind1

0 replies

2h42m

2024-07-06 15:46:11 UTC

Thanks, not sure what's up with the downvotes, it's simply a fact of the math, and in particular the math here.

A computer random number generator with a seed deterministically based on MAC address is another less trivial demonstration of the idea.

necovek

0 replies

14h51m

2024-07-06 03:37:04 UTC

That's mostly focusing on broken RNG properties (i.e. static seed for a pseudo random generator, an outright bad algorithm, or bad entropy source for non-pseudo RNG). They make random numbers predictable, thus your UUIDs become predictable.

Cryptographic hashes are usually also dependent on RNGs having good properties (encryption has been broken due to bad RNGs too), but they are a bit more involved to figure out even if you know the input randomness, and are easier to invalidate (change the secret).

aero_code

1 replies

19h8m

2024-07-05 23:20:06 UTC

"it certainly will eventually" - I think even that is underselling how unlikely it is for Git 256-bit hashes to collide. I calculated (taking into account the birthday paradox), that even if 8 billion people on Earth each created a Git commit every second, that they would have to do that non-stop for 1,588,059,911 trillion years before there's a 50% chance that any of the two commits have the same hash. Our sun is predicted to only last 0.005 trillion years more.

√(2^256)×1.1774÷8000000000÷3600÷24÷365 = 1,588,059,910,945,875,138,261

UUIDs are more likely to collide but still basically impossible. You'd have to generate √(2^122)×1.1774 = 2714899559048203259 to have 50% chance of a collision. Just to store the UUIDs for that database would take 39,506,988 TB of space. If you aren't thinking about your database not fitting on millions of drives, don't think about UUID collisions.

openthc

0 replies

14h29m

2024-07-06 03:59:08 UTC

Collision Calculators:

- https://alex7kom.github.io/nano-nanoid-cc/ - https://devina.io/collision-calculator

throwawayffffas

2 replies

20h55m

2024-07-05 21:33:04 UTC

They are not theoretically guaranteed they are in practice though. 2^128 and 122 are big numbers. Even if you are producing a billion per second you have a 50% chance of not getting a collision for 100 years.

jacobgorm

1 replies

20h44m

2024-07-05 21:44:28 UTC

I've used 128 secure-random bits for ages, not caring for any of the UUID version nonsense. Per the birthday paradox, I need to have 2*64 entries in my tables to reach 50% collision probability, and it will be a while before I can afford that much storage anyhow.

efilife

0 replies

2h55m

2024-07-06 15:33:07 UTC

And of course the sanest person gets downvoted, as always

wongarsu

0 replies

21h22m

2024-07-05 21:06:21 UTC

UUIDv2 uses Mac addresses, but those turned out to be mostly a bad idea. Today when people say UUID they mean UUIDv4, which is just 124 random bits (and 4 version bits). Assuming a good random number generator it's basically impossible to generate the same UUIDv4 twice by pure chance. Even if you make billions of them per second it's vanishingly unlikely to happen.

nubinetwork

9 replies

19h29m

2024-07-05 22:59:12 UTC

If you're generating random UUIDs as the primary key, how do you not run into key collisions? Having to search the entire table before inserting is slow, and catching the error and trying again is also annoying.

davisp

4 replies

19h19m

2024-07-05 23:08:48 UTC

Worrying about UUID collisions is like worrying about being hit in the head by a meteor. Sure, its technically possible, but it happens so rarely that worrying about a collision as a performance concern is just a misunderstanding on how UUIDs work.

And, it’s so random that if you ever do see a collision you should immediately start looking for a compromised system or bug. This is basically how GitHub discovered the OpenSSL bug that had removed too much entropy from the RNG setup.

kevincox

3 replies

16h43m

2024-07-06 01:44:44 UTC

You don't need to worry about a collision in a UUIDv4 that you created on your server. But I have seen a surprising number of applications that took a UUID generated client side and basically upserted it. Allowing taking over resources who's ID was known via the insert API (even if the update API has proper access control).

groestl

2 replies

4h8m

2024-07-06 14:20:20 UTC

UUID generated client side and basically upserted it

Read and take notes. This is crazy in untrusted environments.

kevincox

1 replies

2h59m

2024-07-06 15:28:45 UTC

Generating IDs on the client can be very useful for offline-first systems. But you need to check for conflicts and permissions on the server (or be sure to keep the IDs secret which I wouldn't recommend).

groestl

0 replies

2h22m

2024-07-06 16:06:20 UTC

Agreed, but in that case "upsert" is also weird, since I'd structure such a system around an immutable log datastructure.

netcraft

0 replies

19h9m

2024-07-05 23:19:14 UTC

The point of UUIDs is that they're inherently unique. You can generate them in the database or in your application, or anywhere. You can move records between databases without worrying about a collision. Its this reason that theyre so useful and so many people use them that others like to write articles like this.

You do not need to search. I have generated hundreds of millions and have never hit a duplicate. Its technically possible, but its so vanishingly rare it will be something to brag about, not worry about. If you're really worried about it, and building something that cannot tolerate an error, put a try/catch around it and detect a PK failure and try again. But it will be a waste of time.

lazide

0 replies

17h33m

2024-07-06 00:55:16 UTC

It should be so rare that if it did happen, you likely have a serious bug somewhere. Crash dumping and investigating is the right course of action, not building in retry logic.

So if you are using it as a PK value, only ever insert and if there is a duplicate, blow up loudly.

Performance wise, that isn’t particularly impactful IMO.

jrochkind1

0 replies

16h55m

2024-07-06 01:32:59 UTC

If it's indexed in a unique index, as a primary key certainly would be, the DB is of course already checking for collision on insert, so there's no reason for application code to do it. (If it did, it would be an indexed lookup rather than a table scan).

So if there's a collision you'll get an error. You can write application code to handle the error (by re-generating a new ID and re-trying). Or you can figure it's so unlikely (googling for uuidv4 says "1 in 2.71 x 1018", which is pretty huge; not sure for uuidv7 which will be somewhat more likely) that you aren't going to worry about it, and if an error happens, oh well, errors happen sometimes, depending on the domain you are in, which is probably what many apps do and do fine with it.

I would guess that pg's built in uuidv4-generating implementation might re-try on colision under-the-hood, but i haven't checked to see, and it may also just count on the improbability and raise an error if it happens!

joshuamcginnis

0 replies

19h19m

2024-07-05 23:09:06 UTC

This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.

https://www.postgresql.org/docs/current/datatype-uuid.html

sbuttgereit

8 replies

19h29m

2024-07-05 22:59:16 UTC

The article had a link to the PostgreSQL commitfest for UUIDv7 support, but as far as I can't tell it looks unlikely that it will actually be in PostgreSQL 17. The most recent action was the committer being removed from the task and I believe version 17 is already well past feature freeze.... Is my understanding correct? This is what I think is going on, but I can't find any substantiated facts too point me to a definite conclusion.

I was hoping to see it in version 17, but can't get a really good read on what's going on with this feature.

klabb3

5 replies

19h19m

2024-07-05 23:08:39 UTC

What database support is needed? Assuming Postgres already has the uuid type and that you can (and often should) do the actual generation of them in application code?

tyre

2 replies

19h11m

2024-07-05 23:16:41 UTC

Why should they be done in application code?

eproxus

1 replies

18h48m

2024-07-05 23:39:44 UTC

There’s no need since they are random, so you might as well generate them on your application servers which are easier to scale, to offload the write database.

bozey07

0 replies

18h25m

2024-07-06 00:02:57 UTC

IMO marking the row as generated makes its intent clearer. It also means less code - no generating UUIDs, shorter queries = more robust.

For my workloads at least, I think it's worth the tradeoff of forcing Postgres to muster 15 random bytes.

sbuttgereit

1 replies

16h57m

2024-07-06 01:30:46 UTC

you can (and often should) do the actual generation of them in application code

I can absolutely see scenarios where it would be desirable or convenient to create the these IDs in the application code. However, I can equally assert that you can, and often should, let the database be the source of truth for record identity when using database surrogate keys for that purpose... but I won't.

The truth is there are many kinds of architectures and many kinds of applications the details of which have a direct bearing on where or where not certain operations should be conducted. In many environments in which I work, the idea of "the application" can be murky. In many enterprise environments databases can be driven by and integrated into many applications, possibly from different software developers and vendors with varying degrees of influence in how they internally operate. In such environments the database can be the great unifier and placing substantial control of data in the database itself can make sense in these heterogenous application environments: but it's very much a circumstantial call.

I would expect if there really were one true way to develop application or how to use databases to achieve goals, we could well be asking why they built these functions in the first place... I suspect it's because there are a reasonable number of valid use cases to justify the lift. I'd like to think my use cases fit the envisioned valid use cases thus my interest and question.

klabb3

0 replies

6h10m

2024-07-06 12:18:26 UTC

I can absolutely see scenarios where it would be desirable or convenient to create the these IDs in the application code.

I feel like I’m taking crazy pills. Isn’t the whole (or 99% of the) point of uuids that you can generate them client/application side for idempotency, because entity creation starts there and you want an id earlier than the first round trip completion? If they’re generated by the db (with world-knowledge) then you can auto increment a 64 bit is significantly more performant.

jrochkind1

1 replies

17h1m

2024-07-06 01:26:44 UTC

That is too bad, I too would be very eager to see natively supported in Postgres. It doesn't seem like it should be so complex to implement on top of pg's existing uuid support, but famous last words?

sbuttgereit

0 replies

14h33m

2024-07-06 03:54:53 UTC

It looks like the feature had made good progress with the contention really being around the fact that the standard hadn't yet become an RFC. However, the new UUID standard did reach that milestone back in early May, but the feature kinda went radio silent... at least insofar as any public record... up to a couple days ago when the assigned reviewer/committer was removed.

But no word on if it's delayed, or there's some problem, etc. It's still tagged as 17... and some of the related work has been committed, but not the UUIDv7 itself so it seems.

brigadier132

6 replies

20h46m

2024-07-05 21:41:48 UTC

My strategy is to use v4 Uuids for anything that is not inserted frequently and don't need to be ordered (think user ids) and v7 ids for things that are.

If your dataset is small the overhead from Uuids wont matter, if your dataset is large the randomness of Uuids will save your ass when you migrate to a distributed solution.

rareitem

5 replies

19h24m

2024-07-05 23:04:23 UTC

Is there a reason to not always use v7 by default?

davisp

3 replies

19h13m

2024-07-05 23:14:51 UTC

Information leakage since they have a timestamp component. Some people may not care, but plenty of folks do. As others have said, anything security related likely should UUIDv4. UUIDv7 is basically an engineering compromise on security (they leak a timestamp) vs performance (random reads and writes to an index aren’t as performant as localized reads and writes).

rareitem

2 replies

19h10m

2024-07-05 23:18:14 UTC

Thanks, I get it. Do you have any example where leaking a timestamp could pose a security risk? I can't think of any.

mewpmewp2

0 replies

14h8m

2024-07-06 04:20:15 UTC

If it is public, any cases where for an entity, when it was created could be potentially sensitive information. E.g. some sort of legal documents, product, ecommerce data that could potentially reveal data to your competitors, etc.

lazide

0 replies

17h36m

2024-07-06 00:52:23 UTC

Timing attacks?

brigadier132

0 replies

18h54m

2024-07-05 23:33:40 UTC

Information leakage like the other person said but also if you ever move to a distributed db v7 uuids can lead to hotspots.

rootedbox

5 replies

20h28m

2024-07-05 22:00:36 UTC

Just a heads up about UUID 7.. be careful when using.

par the RFC

If UUIDs are required for use with any security operation within an application context in any shape or form then [RFC4122] UUIDv4 SHOULD be utilized.

spoiler

2 replies

20h10m

2024-07-05 22:18:06 UTC

I was thinking of adopting UUIDv7 for some of my stuff. So, I'm curious: why is this the case? Is it because of the time component?

vbezhenar

0 replies

19h44m

2024-07-05 22:43:48 UTC

1. UUIDv7 allows to extract timestamp

2. UUIDv7 allows to predict first half, if you know the timestamp.

3. UUIDv7 provides 62 bits of randomness compared to 122 bits for UUIDv4.

Whether that's a problem for your particular use-case or not, it's up for you to decide. I don't think that UUIDv7 is "insecure", it just provides different trade-offs and in some situations it might be less secure compared to UUIDv4, but I hardly see any attack vector where you could issue 2^62 requests to brute-force the ID.

lulzury

0 replies

19h36m

2024-07-05 22:52:27 UTC

That's what the RFC states:

Timestamps embedded in the UUID do pose a very small attack surface. The timestamp in conjunction with an embedded counter does signal the order of creation for a given UUID and its corresponding data but does not define anything about the data itself or the application as a whole. If UUIDs are required for use with any security operation within an application context in any shape or form, then UUIDv4 (Section 5.4) SHOULD be utilized.

https://datatracker.ietf.org/doc/html/rfc9562#name-security-...

jandrewrogers

1 replies

18h57m

2024-07-05 23:31:21 UTC

Unfortunately, UUIDv4 is often specifically prohibited for some sensitive applications due to many cases of broken RNGs undermining the security guarantees of UUIDv4. Deterministic generation plus encryption is a common choice in these cases.

jrochkind1

0 replies

17h0m

2024-07-06 01:28:34 UTC

The important point is that UUIDv7 is even worse (and unsuitable) for that application so anything prohibited UUIDv4 for those reasons would not be okay with v7 either!

londons_explore

4 replies

20h24m

2024-07-05 22:04:30 UTC

It would be nice for these comparisons to also include 'int64' so people can see how much of an overhead UUID's are compared to the traditional approach.

tills13

3 replies

18h33m

2024-07-05 23:55:29 UTC

The problem here is that auto increment ints are guessable.

The size of that problem depends on your situation

sgarland

1 replies

16h36m

2024-07-06 01:52:37 UTC

If your API is relying purely on the user ID for AuthZ, you have much bigger problems.

t-writescode

0 replies

12h50m

2024-07-06 05:38:13 UTC

It's not just authz, it can also be a reflection of how many customers you have, and that can also be its own problem.

Or, let's say you are a middleware company and you're trying to contract your services to two other companies. These companies are competitors of each other, but they're also the biggest in their space.

If you have use, let's say, autoincrementing ints for orders or operations, then the companies could use the distance between order numbers as a way of calculating how much business the other business is doing.

If that functionality is possible / easy, that may be a reason for those customers to stop using you, and you've now lost one or two of your major contracts.

Apparently this has been brought up by others and is called the German Tank Problem.

lazide

0 replies

17h37m

2024-07-06 00:50:46 UTC

Also auto increment IDs that are exposed publicly cause [https://en.m.wikipedia.org/wiki/German_tank_problem].

quotemstr

3 replies

21h40m

2024-07-05 20:48:15 UTC

If you have an option to choose, take a look at TSID maintained by Vlad Mihalcea.

TSID: > A Java library for generating Time-Sorted Unique Identifiers (TSID).

Wouldn't this TSID thing be more useful if it were implemented as a set of PostgreSQL stored procedures or something than a Java library? Not everyone uses Java.

stoperaticless

0 replies

8h20m

2024-07-06 10:07:55 UTC

- Plenty of use-cases want to generate it on application side, so they can just “insert”, instead of “insert+select the id”

- not everybody uses Postgre (either way, somebody will be left out)

busterarm

0 replies

21h27m

2024-07-05 21:01:31 UTC

It's forked from https://github.com/f4b6a3/tsid-creator. While this is also in Java, it has many reimplementations in other languages.

avkrpatel

0 replies

11h15m

2024-07-06 07:13:32 UTC

Ulid is also similar to it.

tantalor

2 replies

18h23m

2024-07-06 00:05:20 UTC

inserting UUID v7 is ~2x faster and inserting regular UUID v4

Okay but why

mythrwy

1 replies

18h16m

2024-07-06 00:12:20 UTC

It tells you in the previous section:

"UUID - even though always looks similar - comes in multiple variants. Java's UUID.randomUUID() - returns UUID v4 - which is a pseudo-random value. For us the more interesting one is UUID v7 - which produces time-sorted values. It means that each time new UUID v7 is generated, a greater value it has. And that makes it a good fit for B-Tree index."

necovek

0 replies

15h0m

2024-07-06 03:28:23 UTC

The benchmark seems to include generation in the total time. Not sure that's a useful comparison of how b-tree index behaves in each case (as UUIDv7 has fewer bits of randomness, it's also cheaper to generate).

scotty79

2 replies

21h36m

2024-07-05 20:51:52 UTC

Isn't B-Tree with UUIDv4 keys getting more balanced than with UUIDv7? Doesn't longer insert time result in faster searches later?

stoperaticless

1 replies

8h24m

2024-07-06 10:03:51 UTC

Related pondering:

- UUIDv4 causes page fragmentation (ideally, on disk pages of data would be stored in sorted fashion; HDDs still exist)

- if in your app fresh records have more reads, having them close will help with read speed (being cpu cache friendly)

hu3

0 replies

3h31m

2024-07-06 14:57:04 UTC

To contribute if I may:

Page fragmentation from UUIDv4 can be desirable for SQLite because marking the same page as dirty can give SQLite little opportunity to sync it to disk. Causing contention. In this scenario your Write Ahead Log (WAL) file just grows and grows.

SQLite introduced WAL2 which means switching between 2 WAL files instead of one, helping to mitigate unbounded contention. But it's recent. Most tools don't have SQLite supporting that yet.

sarreph

2 replies

15h20m

2024-07-06 03:08:03 UTC

We chose ULID for our Postgres PK recently, and this article helped a lot in making that decision: https://brandur.org/nanoglyphs/026-ids

I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.

saaspirant

1 replies

15h12m

2024-07-06 03:15:43 UTC

How do you deal with ulid exposing the timestamp (since is lexicographically sortable) ? Maybe your ULID is not public facing? Or this is not an issue for your application?

I want to use something url friendly too since uuid sucks..

presentation

0 replies

5h38m

2024-07-06 12:49:49 UTC

You could probably just use random uuids and then encode/decode them to ULID whenever you read/write them.

bob1029

2 replies

8h20m

2024-07-06 10:07:42 UTC

I see some comments conflating privacy of sequence statistics with global uniqueness considerations and UX.

If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.

If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.

If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).

Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.

almog

1 replies

51m

2024-07-06 17:36:42 UTC

It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.

Is there a reason not to use version 4 UUID and if time ordering is needed, save timestamp explicitly as another column?

bob1029

0 replies

41m

2024-07-06 17:47:33 UTC

Type 4 guids are a reasonable choice.

I personally prefer the extra 128 bits of entropy, but I have no evidence that says it's more secure in practice.

aabhay

2 replies

21h36m

2024-07-05 20:51:49 UTC

I think insert performance is a bad way to evaluate performance here, no? While B-Tree performance for time sorted keys is better on insert, what about during large transactions?

In SQLite, my assumption was that the consensus was towards UUID4 rather than 7 because it meant less likelihood for page cache contention during transaction locks? Would that not also roughly map onto a Postgres-flavored system? Or dues Postgres only have row-level locking?

samokhvalov

1 replies

21h26m

2024-07-05 21:02:26 UTC

there is also a problem of data locality and blocks present in caches (page cache, buffer pool) at any given time, in general -- UUIDv4 is losing to bigint and UUIDv7 in this area

shi

0 replies

6h26m

2024-07-06 12:02:26 UTC

This is a valid point but it highly depends on the use case and larger context on whether this will be relevant or not. If you have a table where you fetch multiple rows close to each other in paginated manner this would be relevant for the performance but if you only fetch individual records by uuid, data locality wouldn’t increase the performance.

Pinus

2 replies

10h22m

2024-07-06 08:06:15 UTC

I’m not sure I understand the bit about time-ordered values being a better fit for B-trees. Actually, I’m sure that I don’t understand it... If anything, I’d naively assume that inserting already-sorted values would be a worst case, always going down the same branch of the tree and triggering re-balancing operations left and right (figuratively — literally mostly left!). The Wikipedia article on B-trees notices that there are some shortcuts that can be taken if one knows one is going to do a batch insert of sorted values. Is that what happens? But then how does Postgres know this?

conradludgate

1 replies

9h57m

2024-07-06 08:30:45 UTC

Btrees don't need to be rebalanced. They only split. The cost of splitting to the root is counteracted by the cache locality of always accessing the same set of pages. Inserting into the last page only needs O(logn) page accesses and O(logn) page writes worst case.

Pinus

0 replies

9h21m

2024-07-06 09:06:51 UTC

Ah, of course. Thanks!

rjurney

1 replies

14h47m

2024-07-06 03:40:41 UTC

I've started using python-ulid as ULIDs seem superior to UUIDs. https://pypi.org/project/python-ulid/

openthc

0 replies

14h39m

2024-07-06 03:48:55 UTC

I love ULID too; but it's really just UUIDv7 (or v8) with a mustache (it's all just 128bit IDs). And in the PG world, where there isn't native support for ULID one can use UUIDv7/8 and bridge that gap. We use ULID extensively in exposed IDs, it's very URL friendly, copy/paste friendly, etc -- but it's a UUID datatype in the DB. (ie: ULID => UUID => DB)

osigurdson

1 replies

17h40m

2024-07-06 00:47:44 UTC

I'm surprised there is no mention of hash based indexes (i.e. CREATE INDEX ... USING hash...) since lookups would conceivably always use equality.

openthc

0 replies

14h31m

2024-07-06 03:56:48 UTC

With UUIDv7/v8 (and ULID) there are some timestamps in the front half. I've seen spots where the query was in the style of `uuid_col >= 'SOME_UUID_0000' AND ulid_col <= 'SOME_UUID_FFFF'` When one is using them ast the timing for record create/insert these things happen.

hu3

1 replies

22h10m

2024-07-05 20:17:48 UTC

Would have been nice to also include bigserial/bigint in the INSERT performance comparison to understand the impact of migrating those to UUIDv4 and UUIDv7.

AdamJacobMuller

0 replies

22h0m

2024-07-05 20:27:42 UTC

Also testing using postgres-native uuid generation

branko_d

1 replies

8h10m

2024-07-06 10:18:16 UTC

I'm surprised author didn't mention foreign keys: since primary key is often referenced by foreign keys, then if you have a "fat" PK all your FKs will also be "fat". This can be solved by using the UUID as an alternate key and the regular integer as a primary key in the "top" table, and then referencing that integer from all the "child" tables.

Obviously, this has ramifications for data write perf, but may well be worth it depending on the use case. Technically, you could also make the integer alternate (and leave UUID as primary), but that may not de desirable for clustered / index-organized tables (for DBMSes that support them).

groestl

0 replies

4h18m

2024-07-06 14:10:28 UTC

If the UUID is used for external representation, now you need to join a row to get it.

blits

1 replies

9h38m

2024-07-06 08:49:40 UTC

You might also be interested in TypeID, which is extension of UUIDv7 and also sortable and unlike UUIDv7 more human readable.

It is inspired by Stripe IDs and looks like that: user_2x4y6z8a0b1c2d3e4f5g6h7j8k

I've recently built a postgres extension for it that allows to use it like UUID: https://github.com/blitss/typeid-postgres

hu3

0 replies

4h13m

2024-07-06 14:14:55 UTC

love the underscore instead of hyphen. easy to double click for selecting text.

GiorgioG

1 replies

5h17m

2024-07-06 13:10:53 UTC

UUIDs are miserable to work with. For 99% of use cases, sequential IDs are fine. Can they be guessed? Sure, but your software should guard against unauthorized access. Security through obscurity or randomness is a poor excuse for using UUIDs as PKs. If you don't want to expose your IDs, use a slug.

hu3

0 replies

4h42m

2024-07-06 13:46:25 UTC

Agreed.

The dev experience of debugging with UUIDs involved degrades so much, it is depressing.

For example it's much harder to spot patterns and wrong IDs in SQL query results if you are looking at these giant blobs of random characters called UUIDs.

vog

0 replies

20h27m

2024-07-05 22:01:28 UTC

Note that the article's link to the UUID v7 standard is meanwhile outdated. You should instead head directly to RFC 9562:

https://datatracker.ietf.org/doc/html/rfc9562

(which wasn't yet finished at the time of the article)

samokhvalov

0 replies

21h31m

2024-07-05 20:57:17 UTC

some related stuff:

- https://commitfest.postgresql.org/48/4388/ (original patch created live https://www.youtube.com/watch?v=YPq_hiOE-N8)

- https://postgres.fm/episodes/uuid

- https://postgres.fm/episodes/partitioning-by-ulid

- https://gitlab.com/postgres-ai/postgresql-consulting/postgre...

rizky05

0 replies

18h26m

2024-07-06 00:02:29 UTC

I've been using gen_random_uuid() (postgres's function) as default value in primary key. is it any downside ?

necovek

0 replies

14h46m

2024-07-06 03:41:53 UTC

The benchmark for UUIDv7 insertion time includes UUID generation: when looking at simply index update cost, I'd like to see those decoupled.

FWIW, I do expect UUIDv7 to take roughly half the time to generate if it has roughly half the random bits of UUIDv4 and an unprotected source of entropy is used.

kortex

0 replies

2h51m

2024-07-06 15:37:17 UTC

Has there ever been an attack/leak based on time metadata in UUIDs/ULIDs/Snowflakes/etc (not sequential), in a domain outside of sensitive ones (HIPAA/healthcare, defense)?

I'm skeptical that for the vast majority of uses, have time data in the IDs (and also not leaking it from actual timestamp fields) is an issue, and for the cases where it does, just use random IDs.

Even theoretically, is there a way to translate timestamps into a German tank problem? If I give you a sampled set of timestamps (with no other node identifiers), can you estimate the cardinality of the greater dataset?

hakanderyal

0 replies

11h1m

2024-07-06 07:26:54 UTC

Using ULID has probably the best ROI among all the architectural decisions I've made. Generating IDs on the server before insert is helpful, and not just for distributed system issues.

If I don't want to leak the timestamp, I just use an auto generated integer along with it.

avkrpatel

0 replies

11h15m

2024-07-06 07:12:43 UTC

Instead of uuid7 why cant they use ulid?