return to table of content

Sqids – Generate short unique IDs from numbers

parhamn
28 replies
19h55m

Side note: there are some business insights you can get from a company using serial ids.

i.e if you sign up and get user id 32588 and make another account a few days later, you can tell the growth rate of the company.

And this is possible with every resource type in the application.

I do wonder how much the url bar junk thing matters these days. I tend to use uulids (waiting on uuid v7 wide adoption), and they're a bit ugly, but most browsers hide most of the urls now anyway. The fact that there is a builtin time component comes in clutch sometimes (e.g. object merging rules).

pacificmint
9 replies
19h36m

you can tell the growth rate of the company.

You can even do this when you don’t know the exact interval by using probabilities. The Allies used this method to estimate German tank production in World War II by analyzing the serial numbers of captured or destroyed tanks.

This is know as the German Tank Problem [1]

[1] https://en.wikipedia.org/wiki/German_tank_problem

leobg
5 replies
17h53m

Very interesting.

I’m a lawyer and using sequential IDs in a fraud case right now, to determine the number of victims.

Unfortunately, so far, I only have the IDs of two victims, and those are from just within about a month, whereas the fraud has likely been going on for several years. Just simply extrapolating that growth rate isn’t going to be very accurate.

Also, I suspect that the perpetrators did not start at ID 1.

namtab00
3 replies
17h12m

ehm, yeah, n=2 will not get you anything useful...

that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's

selcuka
1 replies
15h22m

that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's

Ironically that would be somewhat close to the actual average.

mcherm
0 replies
12h43m

It would be significantly above the average unless the company is ridiculously top-heavy or has shockingly little variation in salary. Or if the "salary" for the CEO ignores certain compensation (eg: paid a salary of $1 + stock options).

pixel8account
0 replies
6h2m

Even with n=1 you can get something useful. IIRC "on average" if you have ID x than the best population estimation is 2*x. Of course the error margin is immense, but it's still better than nothing.

infogulch
0 replies
16h40m

You might try to use the information to find more victims first.

lhamil64
1 replies
19h19m

It also makes it slightly easier to perform certain attacks since it's trivial to figure out other IDs.

ozim
0 replies
8h39m

Making non-guessable IDs for broken authorization is security by obscurity.

If you have integer IDs it is also trivial to find authorization flaws on your own. Any pentester will go for it right away.

If you make non guessable IDs they might skip it and go look for other stuff.

rkagerer
0 replies
9h53m

I would have introduced random, increasing skips in the sequence to make my army look 10x bigger.

throwaway47747
2 replies
15h26m

Can confirm, when I worked in VC we used this to verify order volume for a number of startups we were evaluating. For one startup, I wrote a bot to place a small order a few times a day, and log the order number.

ChrisCinelli
1 replies
12h43m

Did you use this for due diligence to verify that the data reported by the entrepreneur in the pitch was good or you were looking at some companies in a specific space and checking the company that has more orders?

throwaway47747
0 replies
11h37m

The former.

paulddraper
2 replies
18h39m

most browsers

Not chrome...

Also, links are a thing in chat, etc

parhamn
1 replies
18h23m

Heres what a recent youtube (which squid documents as a sample use case) link I shared looked like:

https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVzZ...

Or Twitter:

https://x.com/elonmusk/status/172853302828286055507?s=20

Or TikTok:

https://www.tiktok.com/@<userId>/video/730292574259232054785...

While I tend to strip the tracking params and there are extensions that do this, I don't think most people do. These URLs are pretty 'ugly'.

So if the links that are being shared most on the internet (YT, TikTok, Twitter) don't care, you probably shouldn't either. I think the onus is on the UI layers (Chat apps, etc) to show urls how they look best on their respective platforms.

Edit: to this point, it looks like HN truncates these to make them less ugly too.

wodenokoto
0 replies
12h6m

If you use the share button and not the url bar you get prettier and shorter urls.

Don’t know how common that is. Wouldn’t be surprised if no -techies don’t know how to copy from url-bar.

hot_gril
2 replies
12h32m

Yes but just using Sqids doesn't fix this. Sqids are decodable. You need to use a random or otherwise unpredictable input.

What's the advantage of uuid7 (sequential + random) vs uuid4 (full random) for this?

krembo
1 replies
10h47m

You can extract the timestamp from UUID7/8 so it also can reveal business information.

hot_gril
0 replies
10h5m

That's why I'm asking, it seems like a liability.

raggi
1 replies
18h2m

I see so many organizations add weird slowdowns from debts associated with this. I reflect on some of the most successful tech businesses of the last decade and remember that all their APIs exposed this kind of data early on and many still do.

Does anyone have an example they can reference of a business being harmed by this information being out there?

wodenokoto
0 replies
12h8m

I don’t know any stories of digital businesses but there was a case where someone went and counted customers at the door only to realize that the company was lying on the yearly reports.

So I guess that kinda harmed that fraudulent business strategy…

aequitas
1 replies
18h9m

At an internship long ago, my boss instructed me to always add a few extra to the auto incremented order ID so customers couldn’t guess how business was going if they happen to order stuff quickly in a row.

qumpis
0 replies
15h52m

How big and random were these "few extras"?

taneq
0 replies
13h41m

I always take note of invoice numbers for this exact reason, they give you a feel for how busy they are.

maxcan
0 replies
2h20m

Yes, that's how I know I was roughly the 600,000th person to sign up for thefacebook.com.

hinkley
0 replies
16h31m

And this is the stuff you get if you manage to get your access control right.

Get it wrong and we jump from actionable business metadata to actionable business data (like perhaps which of your customer's customers are poachable)

diznq
0 replies
1h29m

The solution I prefer is to simply just encrypt the data such as IDs.

Instead of giving user an ID in response, user gets hmac(cipher(Data, secret_key), secret_key) + cipher(Data, secret_key) and then some simple pre-request handler just iterates over query params / form data and decrypts them if signature matches.

It also works as a really nice CSRF protection as user ID of currently signed user can be embedded into Data and checked if current user.id == decrypted data.id.

Another nice advantage is that you can deny the request right in the beginning as you know ahead of time that the provided data is not valid (signature doesn't match), saving some DB queries.

The down side is that URL gets pretty long though, but if that's hidden by browser or user doesn't care, it's a non-issue

cortesoft
0 replies
12h8m

Unless they are doing master-master replication so are incrementing by something other than 1

progne
18 replies
21h25m

In a Ruby app we just convert to a high base, like

  > 1234567890.to_s(36)
  => "kf12oi"
That gets us most of the way there, but Sqid has a Ruby library and lets you set a much higher base, including upper case characters, and I suppose, emoji. We're going to need much bigger numbers before that space savings makes much difference. I like it, but it's hard to know when something like that is worth adding a dependency.

vyrotek
14 replies
21h11m

I believe a big part of the idea is for the hash to be unpredictable as well.

If I figure out you're using (36) then I know the next number 1234567891 is "kf12oj".

Not the case with Sqids.

echelon
5 replies
21h5m

I'd prefer to use crockford-encoded entropy with Stripe-style token prefixes to create unique ID namespaces. Run in through a bad words filter, and it's perfect.

user_1hrpt0xpax7ps

file_xpax7psaz0tv6az0tv6

Etc.

In distributed systems you can use the trailing bytes to encode things like author cluster, in case you're active-active and need to route subsequent writes before create event replication.

Easy to copy, debug, run ops/incall against. If you have an API, they're user-friendly.

Of course you still want to instruct people the prefixes are opaque.

wombatpm
3 replies
20h49m

Yeah don’t forget the bad words filter. I worked on an IKEA mailing where the list processing house was adding an autogenerated discount code to the address label. The customers received codes with BOOB, DICK, TWAT, and CUNT embedded within. People were not happy.

otteromkram
2 replies
19h21m

Did they never make an IKEA purchase after that or did they get over it like a normal adult?

I don't work retail, but something tells me people will make a stink out of just about anything if it meant potentially free products or other compensation.

Plus, are you filtering just English curse words or all curse words for countries that use Latin characters?

jl6
1 replies
16h44m

The risk is not in offending someone, but in that someone posting the rude string on social media in real or mock indignation, causing the outrage machine to turn on your brand. There’s a steady supply of bottom-feeding journalists waiting to write the article “IKEA’s new system is sending hate messages to customers”.

richev
0 replies
12h40m

Can you cite an example of a journalist writing an article that makes such an accusation?

sandGorgon
0 replies
12h17m

this is very interesting. do you know how amazon creates its order id ? they are all numeric

hot_gril
3 replies
21h7m

You can easily brute-force this. Sqids also says it's not good for sensitive data.

8organicbits
2 replies
19h55m

It looks like an easy brute force too, there's no compute-hard operations here. I guess you could scramble your alphabet? Otherwise Uk always comes after bM, etc.

posix86
1 replies
14h58m

My understanding is that you can re-order the source alphabet, and encode numbers with swapped characters. Unless you know of 36 numbers that they're exactly 1 id apart, you will always have uncertainty to what the ids actually map to.I guess given a large, large number of ids along with the order they're assigned (which might be given through the time at which they're assigned), you could create a pribabilistic statement on the actual order of the 36 characters based on the fact that most numbers increase/decrease faster the bigger they are. (this fact is e.g. used to detect fabricated bank statements - if the first digit of any number in the statement is equally likely to br 1 or 9, the numbers are randomly generated whereas if they're real, 9 is less likely than 1)

8organicbits
0 replies
5h54m

Scambling the source alphabet should have an effect similar to a monoalphabetic substitution cypher. This is not strong cryptography. If the attacker has any ability to generate IDs quickly, like by creating user accounts or other resources they can create many IDs with known ordering. Likely effective against non-serious attempts.

paulddraper
2 replies
18h37m

No, squids are predictable, you can't use them to hide information.

They call it out on their front page.

richbell
1 replies
13h50m

I haven't looked at the implementation yet but HashIds (the former project name) required a salt. It would be weird if they changed that.

slig
0 replies
4h41m

The `salt` in hashids just shuffled the alphabet. Now they removed the `salt`, but you can have the same level of "obfuscation" as before if you shuffle the alphabet yourself before calling the library.

pelagicAustral
0 replies
18h52m

Correct me if I'm wrong, but, It cannot be unpredictable, which makes the library redundant for security concerns, which would be the one business case to seek for anything other than an UUID (which is already built into Ruby).

candiddevmike
1 replies
21h9m
paledot
0 replies
12h4m

Ironically its default encoding is base 55. I figured if you're going to use emoji you should at least use the size of the space to maximize the numbers of bits per grapheme beyond what is possible with ASCII, but apparently not.

Which I guess is the part where senior (citizen) programmers like me get triggered as promised in the README.

exxos
0 replies
20h53m

I didn't think of that, but this is a nice trick!

no_wizard
14 replies
22h46m

I like the idea, though I use nanoid with the safe letter dictionary (it excludes letters used for profanity[0])

They should use a similar dictionary approach IMO because I looked at the implementation and it’s hardcoded to look for “bad” words

Otherwise looks real straightforward! I’d love to see some performance test suites for it

[0]: https://github.com/sqids/sqids-javascript/blob/ebca95e114932...

[1]: though with UUID v4 so common to generate and well optimized in most languages I wonder if these userland solutions are really better. You can always generate a UUID and re-encode with base32 or base64 with also is well optimized in most languages

lxgr
12 replies
22h38m

it excludes letters used for profanity

That doesn't seem possible. How would that work?

I looked at the implementation and it’s hardcoded to look for “bad” words.

If you mean https://github.com/y-gagar1n/nanoid-good, that seems to be doing the same thing.

In general, I'm a bit weary of solutions that "guarantee no bad words" – this is usually highly language-specific: One language's perfectly acceptable name is another language's swear word.

njharman
3 replies
20h52m

That doesn't seem possible. How would that work?

agree; b00b, DlCK, cntfcker

But I suppose, if user doesn't get to craft input, the collision space of converted numerical ids and words like above is sufficiently small to be ignorable.

Sharlin
2 replies
19h44m

Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X, v, V, and other lookalikes, so the chances of generating something naughty in any language are close to zero.

jl6
1 replies
16h29m

Humans have a high capacity for spotting rudeness. Nanoid’s nolookalikesSafe alphabet would allow blwjb69FKmyD7CK.

(Sorry)

Two4
0 replies
1h50m

Buy me drink first, jeez

Sharlin
2 replies
22h14m

Omit vowels and you're 90% of the way there; omit the vowel-looking digits 0,1,3,4 and you're probably >99% of the way there.

gberger
1 replies
20h4m

fxck

Sharlin
0 replies
19h48m

Which is, evidently, why nanoids also excludes x and X, as well as v and V (fvck).

livrem
1 replies
21h3m

Looks like the dictionaries used are from this file?

https://registry.npmjs.org/naughty-words/-/naughty-words-1.2...

From a quick look, the lists are pretty short, except for the one with English words that at least have some 404 words, but I can imagine there are far more bad words that you want to avoid than just those?

ape4
0 replies
19h28m
Silasdev
1 replies
21h7m

It's particularly funny because their example docs for .NET outputs "B4aajs", which to any Swedish l33t speaking individual, would read "Bajs", which means "shit"

owyn
0 replies
12h9m

Somewhere there's a database for every bad word and every bad typo in every language and that one just got added.

no_wizard
0 replies
22h30m

This is the implementation: https://github.com/CyberAP/nanoid-dictionary

We use it in a highly internationalized product spanning multiple languages and haven’t yet ran into a complaint or value on audit that would constitute something offense in any language per our intl content teams anyway.

That isn’t to say it’s 100% (and simply enough we don’t audit every single URL) but I suspect we would have gotten at least a user heads up by now

Never the less we are moving our approach to uuids that get base32 encoded for some of our use case for this. They’re easier to work for us in many scenarios

tttp
0 replies
22h36m

I tried something similar with a fixed alphabet that guarantees no profanity and a checksum (luhn)

https://github.com/tttp/dxid

habitue
14 replies
21h54m

Skipping profanity seems like a liability in this design. It means in order to preserve the encoding you need to make the banned word list immutable, otherwise old sqids will decode to the wrong thing when you get them back.

runlevel1
6 replies
20h54m

The stupid simple way I did this ages ago was:

1. Start with a-z.

2. Drop all vowels, numbers, most homoglyphs, and the letter 'x'.

3. Map digits 0-9 to one of the remaining letters.

4. Stringify the integer and replace the digit in each decimal place with its corresponding character.

For my use-case, all the numbers were >7 digits long, so the odds of you getting an offensive acronym were reasonably low unless you started combining them.

But there's no perfect solution. As this dataset shows, you can find offense in almost anything if you look hard enough:

California Personalized License Plate Requests Flagged for Review 2015-2016: https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...

arp242
2 replies
19h14m

Many of those reviewer comments are utterly moronic. And that is my polite opinion.

How does this work? Is there a review board? Is it put to public review? A few of them like "dick out" and "shtlord" are reasonable, but many of them seem so bonkers it looks like the work of trolls.

Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that Octocat really means "eight vaginas".

Zecc
1 replies
14h0m

Many of those reviewer comments are utterly moronic.

Reason for review: hostile, insulting, or degrading

/s

arp242
0 replies
3h56m

WONTFIX: behaves as intended.

air7
1 replies
19h11m

California Personalized License Plate Requests Flagged for Review 2015-2016: https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...

Wow this is a funny peek into a weird perdicment where people need to justify that they have a good reason to have a specific license plate.

Some seems obviously ok such as:

INT13H

314 PI

And some are obviously not:

DRY(hand emoji)JOB

DICK OUT

Come to think of it: Can license plates have emojis now?!

joeframbach
0 replies
17h58m

California allows one hand, star, or heart shape in the plate.

hinkley
0 replies
16h16m

Most numbers can be used as letters or phonemes.

I could give a fuck about avoiding swear words, but if you want to avoid slurs and eyebleach-inducing ideas and still have any sort of compact representation, I suspect we have to look not at problematic letters but problematic groups of letters. There's nothing intrinsically wrong with the letter E. Not with G, I, N, or R, but you can sure get a lot of attention you don't want by arranging them in the wrong order. K and Y aren't bad either, unless you're hating on Jewish people.

So maybe there's a 5:4 or a 5:3 encoding out there where you avoid making syllables.

Etheryte
3 replies
21h30m

I don't think this holds, you can enforce filtering in the encoding step, i.e. be strict about what you output, but always decode, even if the input is profanity. This means you can also be backwards compatible if you update the list etc. So in short, the old maxim of be strict about your outputs and lenient about your inputs.

fimdomeio
2 replies
21h15m

From their FAQ: "The best way to ensure your IDs stay consistent throughout future updates is to provide a custom blocklist, even if it is identical to the current default blocklist."

lights0123
0 replies
19h12m

The *encoding* changes. The decoding stays consistent:

Decoding IDs will usually produce some kind of numeric output, but that doesn't necessarily mean that the ID is canonical. To check that the ID is valid, you can re-encode decoded numbers and check that the ID matches.

The reason this is not done automatically is that if the default blocklist changes in the future, we don't want to automatically invalidate the ID that has been generated in the past and might now be matching a new blocklist word.

Etheryte
0 replies
19h18m

In that case it sounds like a shortcoming on their part. There is no fundamental reason to have that limitation. I understand it can make the implementation easier to not have it, but in my opinion being blocklist change agnostic would be a much better value offering.

8organicbits
1 replies
19h25m

Agreed, this is a big risk made worse that the default word list can change over time.

https://sqids.org/faq#future-blocklist

rafram
0 replies
15h54m

It should at least take a blocklistVersion parameter (or similar), even if passing the wrong version just generates an error.

kaetemi
0 replies
19h20m

It's a base62 encoder that takes multiple integers as input. Probably a bit-length prefixed encoding. I am assuming it just pads an extra junk integer to re-roll the encoded number.

wslh
11 replies
16h32m

I offered something similar here [1] and it is used by many companies including Philip Morris, and the Argentinian tax agency for the same purposes.

The technique I used (I should publish it as open source) is using a Feistel cipher [2] with a key. The Feistel network could be adjusted to almost any size and the key used in every round is an expansion of a general key using a key derivation function [3] (KDF3 if I remember well).

Basically it is a symmetric cipher of arbitrary size.

[1] https://www.nektra.com/products/secure-coupon-code-generator...

[2] https://en.wikipedia.org/wiki/Feistel_cipher

[3] https://en.wikipedia.org/wiki/Key_derivation_function

smashed
3 replies
15h46m

I'm not a cryptographer and did not understand half the things you said except symmetric crypto.

I think there are 2 problems with this approach:

How do you prevent the double spend problem, for example duplicate entry tickets. You would have to mark the ticket as used in a central database anyway to prevent it

What happens if the secret key material is compromised? Anyone can issue new valid numbers, etc..

Please correct me if I'm wrong.

wslh
2 replies
15h3m

No problem, the two answers here:

1/ You don't have duplicates because a Feistel network assures you there is no duplicates: every input has a different output with a fixed key in every round.

2/ If the secret is compromised anyone can issue valid numbers in the same way that if your secret encryption keys for encrypting your data exposes it. The idea is that the secret key material is never compromised as it is assumed in all security cases. You should custody with the right measures based on the attack vectors you have. The custody of secrets is independent of this method.

Please let me know if you have more questions.

pixel8account
1 replies
5h46m

The idea is that the secret key material is never compromised as it is assumed in all security cases.

That's not true, we have (perfect) forward secrecy, backwards secrecy and key rotation mechanisms because we often care what happened after the key is inevitably compromised. In this case the problem makes it hard to "rotate" the keys in a meaningful way, but I'm yet to see a proof it's impossible.

wslh
0 replies
5h3m

I think we are talking about different things here or the conversation is not clear.

The assertion you mentioned is the "what" assumption while what you said is the "how" we came to that assertion. Most probably you take my word idea in a specific sense other than I wanted to use. My answer was informal because we are here in HN and not writing a paper.

ZephyrP
3 replies
15h10m

Feistel ciphers are a good technique for doing just this but it's also worth noting that if all you are looking for is "produce a pseudorandom permutation of 1..N without actually shuffling a list of numbers" you can also use an LFSR as well.

wslh
2 replies
15h8m

The difference is that the method is more secure than an LFSR.

ZephyrP
1 replies
15h8m

of course, but an LFSR is going to be faster (provided you have a reasonable number of rounds in your Feistel cipher), have some situationally desirable statistical properties and is easier to adapt than a format-preserving encryption technique like a Feistel network.

wslh
0 replies
15h0m

Sorry, but faster for the computing capabilities we have in every electronic device is no significant in the application. I have not seen a customer inquiring about the speed of this process. They were looking to a method that is more secure than others.

For example, in the case of Philip Morris was about winning prizes and imagine if they tried other methods before where some smart people reversed the method.

rdpintqogeogsaa
1 replies
10h40m

Interestingly, it seems there might have been independent invention of the same idea[0]. Have you checked for any patents in the space?

[0] https://bytes.grubhub.com/why-we-use-crypto-when-generating-...

wslh
0 replies
5h7m

Thank you for the article. I was not aware of that article. My first implementation was done in 1997 or 1998. I am involve in computer security for three decades and I was familiar with Feistel networks. I am not a cryptographer but I have reverse engineered algorithms using this structure. Never checked patents because I think this is a natural construction and I always wondered why there were no other public ideas around this problem. This is not the first time it happened to me.

nwroot
0 replies
10h58m

Securecouponcodes.com is down. Was looking for PHP example.

sandstrom
7 replies
18h26m

Neat library!

We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.

I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.

But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.

(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)

3cats-in-a-coat
2 replies
17h58m

The block list is 2/3 of the (minified) library. I found this entire choice odd.

First, it's highly incomplete because you can find at least 10x more combinations spelling the same "word". And probably 10x more slurs that aren't in this block list. Second, because it's hardcoded in your source. Third, because there are more elegant solutions.

Such as to pick an alphabet that can't spell readable words unless you're trying really hard to read a slur into it. Say this (no vowels or digits):

bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ (length 42)

The full lower+upper+digits alphabet they use is 62. Feels like you're losing a lot, but... not really.

- A 128-bit id in base 62 = 22 letters.

- A 128-bit id in base 42 = 24 letters.

JUST TWO MORE LETTERS. And it's one more letter for 64-bit id (11 vs 12). And we can avoid this entire silliness. The problem is the author doesn't realize that logN is... logarithmic, I suppose.

darkpatterns
0 replies
17h22m

Totally agree. 2/3 is wild, especially given it seems like you could mitigate most of the risk just by removing vowels from the dictionary.

3cats-in-a-coat
0 replies
4h1m

A slight mod, I'd remove Y despite not exactly a vowel, and add back digits that can't be interpreted as vowels.

bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679

Gives us base 45. And below is a JS snippet to make an id. There's your lib.

    function id(num) {
        num = BigInt(num);
        const dict = "bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679";
        let id = '';
        while (num > 0n) {
            id += dict[Number(num % 45n)];
            num /= 45n;
        }
        return id || dict[0];
    }
Example:

    id(123456789012345678901234567890n);

    "bq99hC6fbtjLrkxLPm"

noirscape
1 replies
17h28m

There's other "general" rules when it comes to random human-readable tokens such as not using Os and Is if your strings include numbers - people can and will confuse them with 0s and 1s if they have to type them over.

Most gift card tokens for example don't allow the use of those two (or quietly correct it) to avoid making that mistake.

davchana
0 replies
17h1m

California State Driving License and or ID Numbers are One Alphabet & then 7 digits. But always, if the character at second place is Zero, everybody reads it as Alphabet O.

hobo_mark
0 replies
18h7m

Well if you don't filter, things like this may happen:

https://github.com/compiler-explorer/compiler-explorer/issue...

Me and you will know it's just random characters, but if HR enters the chat...

doublemint2203
0 replies
18h7m

I'm actually more convinced the problem is corporate risk management to deal with the tendencies of social media to overhype issues by design, rather than a statement on society

dfc
6 replies
22h59m

It's weird under "Get Started" they have links to 40 different languages. You can only get started with 15 of the 40 languages listed, the other 25 are skeleton repos asking for people to start the repo to indicate interest.

LeFever
2 replies
22h36m

It’s kinda clever. The people most likely to look at this project are also likely ideal candidates for implementing the library in a new language (Developer, FOSS enthusiast, interested in the project, need the library in a language they’re familiar with that isn’t implemented yet).

Also, the language pills differentiate between those that have been implemented (color logo, dark text, bold) and those that aren’t (grayscale).

alas44
0 replies
22h16m

Also can help track which languages people click on, probably a good proxy of where there would be the need to develop a lib

4kimov
0 replies
22h18m

Good points. Those pages also contain links to old implementations (Hashids), because a lot of projects still use those and want to be able to find them.

hooverd
1 replies
22h43m

Maybe a slam dunk first FOSS contribution?

ctoth
0 replies
21h17m

This seems like a perfect use case for an LLM :)

vyrotek
0 replies
21h2m

The approach definitely works. Some time ago I saw .NET listed but discovered it wasn't complete. I was eager to replace an existing Hashids implementation so I made some comments, shared a starter-snippet, and then someone was excited enough to complete in just a few days. It was great to see how quick the community stepped in. Maybe there was a bit of Cunningham's Law in effect with my contribution, ha.

https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...

s4i
5 replies
12h51m

If the original numeric ID can be figured out from the sqid string, then what’s the point of the conversion?

hot_gril
2 replies
10h46m

The point is to make a long number more human-friendly. But I don't get why these are non-sequential then.

s4i
1 replies
8h4m

Unless we are talking about very long numbers, surely the numbers are easier for people to deal with, say over the phone, etc?

hot_gril
0 replies
7h59m

I was going to say it's for very long numbers, but I tried an example, and 123645634 becomes ARsz1pHw789c7ESzhy. The output is actually longer, and more complicated.

Looks like it uses a hash. I was expecting it to just convert the base except with something to skip profanity, which would give you something much shorter going base 10 to base ~36. Tbh I don't see why it's like this.

nnf
1 replies
12h13m

The original numeric ID(s) can only be decoded if you know the original alphabet that was used for encoding.

hot_gril
0 replies
10h47m

I wouldn't bank on that information not getting out. It may even be easy to reverse-engineer. Sqid says not to use it this way.

andix
5 replies
17h20m

Actually I'm a bit disappointed that it can't format 128 bit integers or byte arrays. That would allow formatting UUIDs. I'm not a huge fan of public facing integer IDs. There is always the risk of leaking some kind of critical information with ascending IDs.

So I will probably keep Base64URL formatting my UUIDs to make them shorter for URLs, QR Codes and so on.

Quick example:

  20b30b32-d421-4cfb-bdbc-9a4e0475abea
  =>
  MguzICHU-0y9vJpOBHWr6g
It's important to keep in mind that there are different ways to convert an UUID to a byte array (little/big endian, order of segments, ..)

hippich
1 replies
6h4m

I actually used sqids algo for something very different where I had to encode arbitrary sized byte arrays. And with Ruby it was very simple to remove limit - I think it was matter of monkey-patching https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb#L... to return infinity.

The reason for the limit is most likely to ensure interoperability with libraries in other languages where working with bignums is much more complicated.

andix
0 replies
5h36m

This limit could easily be lifted with byte array encoding support.

Mogzol
1 replies
16h56m

Even if it could format 128 bit integers, I don't see how that would be any better than base64-encoding them. Base64 is already pretty close to the maximum efficiency you can get for shortening UUIDs while still using URL-safe characters.

andix
0 replies
16h41m

To add some of the additional features Squids provide. You can also convert any smaller integer to a string with Base64.

stavros
0 replies
16h34m

The shortuuid library can do that:

https://pypi.org/project/shortuuid/

James_K
5 replies
22h32m

I guess they haven't heard of base-64.

dymk
2 replies
22h25m

That doesn't solve the same set of problems as TFA. Randomized output order for sequential input, skips IDs that include profanity.

jbverschoor
0 replies
18h1m

There's no info on how this actually works. Actually I have no idea what it does in the first place.. From the FAQ:

How can I make my IDs unique?

The library accepts a custom alphabet from which it can generate IDs. Simply pre-shuffle the default alphabet that's provided.

I also have no idea how to use the library.. What are the 3 numbers parameters for encode()? The actual code: https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb

I'd simply base-58 a GUID instead. At least you'll know what kind of constraints / sorting the IDs will have.

James_K
0 replies
18h4m

There is no need for randomised output, and they've said themselves that you can recover the input from the output. It's literally just so that it looks like something cool and technological is happening. You can skip IDs that contain base64 profanity just by doing the reverse encoding on your block list and skipping those IDs. Or you could exclude vowels from your alphabet and do a base 54 encoding. Or you could grow up and accept that your 4 millionth ID which no one is going to read might have a naughty word in it (horror).

xjia
0 replies
22h29m
majkinetor
0 replies
22h19m

One of the points is also to use custom alphabet.

notfed
4 replies
9h51m

I appreciate that the author clearly states that security, i.e., output can't be reversed back to the input, is a non-requirement. We can't criticize the author too much for that either, because, as a rule, "random-looking id generator" algorithms will always be either not secure, or not short, or not collision-free. Or they'll be a key-value database.

A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.

Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].

[1] https://www.schneier.com/blog/archives/2016/04/security_risk... .

pixel8account
1 replies
6h36m

It's good to consider this but... Plenty of sites expose user ID as a regular integer. In some cases you might want to avoid this (leaking user count to competitors etc), but I have never heard about anyone calling this a vulnerability.

woofcat
0 replies
3h21m

I hear this all the time. Every 3PPT report I see is cranky if you have userid=2345 as you can enumerate it.

Personally I think it's stupid but this is a tempting solution.

geek_at
1 replies
8h23m

i.e., output can't be reversed back to the input

But in the "Not good for" section it states the opposite: "[not good for] User IDs can be decoded, revealing user count" so it can be reversed?

residentcoder
0 replies
7h16m

Added parenthesis to make what your parent commentor is saying clearer:

I appreciate that the author clearly states that security, (i.e., output can't be reversed back to the input), is a non-requirement.
hot_gril
4 replies
15h0m

I think the "unique IDs" part of the title throws people off and brings security to mind. "The main use of Sqids is purely visual" is what you need to know. It's not necessarily for IDs, it's just a user-friendly way to encode/decode numbers.

paulddraper
2 replies
14h52m

Exactly. An alphanumeric encoding for integers.

mlhpdx
1 replies
14h42m

An argument could be made for calling it “compression of the string encoding”. Since HTTP is string oriented, that makes some sense as an optimization of sorts, perhaps.

hot_gril
0 replies
11h20m

It doesn't seem compressed, in fact the output of this is less compressible than the input. It's made for human interface.

hot_gril
0 replies
12h41m

Also, why is it non-sequential? That suggests it's unpredictable, but it's not. Just realized that was the thing bothering me about this.

whalesalad
3 replies
22h51m

This used to have a totally different name iirc, they used to be called hashids

unfunco
0 replies
4h9m

I've always thought it was called bijection.

resoluteteeth
0 replies
22h49m

Yeah, it says that both in the page title and the logo at the upper left

ChrisArchitect
0 replies
15h48m

A recent Show HN: from the dev (who's also in this thread) explaining the rebrand https://news.ycombinator.com/item?id=38185437

revenga99
3 replies
18h55m

is there anyway to generate short unique id's from UUID's? snowflake is incredibly slow when joining UUID => UUID columns.

s4i
0 replies
12h59m

Encoding the UUID in e.g. base 64 or Crockford’s base 32 (instead of the standard hex+dashes) saves you some space.

resoluteteeth
0 replies
17h16m

Isn't the thing that makes UUIDs UUIDs that they have enough bits that they are guaranteed to be unique without any synchronization?

I don't think you could reduce the amount of random bits (I guess there are some non-random parts in standard UUIDs but not a significant amount) while preserving that property unless you add back some other form of synchronization to ensure that there aren't collisions which seems like it would defeat the purpose.

bjt
0 replies
15h50m

If you know which type of uuid you have (v1, v4, etc) then you can take a look at how many bits of randomness it has, how many total items you have, and compute the probability of a collision if you just take a subset of the bits and use that as an ID.

In theory it's definitely possible. The 128 bits you get in a UUID is a LOT of randomness for an identifier. Postgres BIGINTs are just 64 bits. Instagram's sharded IDs are just 64 bits. (See below.)

You can test it. If you're using uuidv4 (which is 100% random bits, minus a few for the version), you could make a new column in your table in Snowflake, populate it with the first 64 random bits of your existing uuid column, then see if you have any collisions.

https://instagram-engineering.com/sharding-ids-at-instagram-...

orf
3 replies
19h2m

How do you adjust or evolve the blocklist with this, without making previously generated IDs incorrect?

The ID is simply incremented if it is blacklisted [1]. So the ID is fixed to the blacklist content, and adjusting it in any way invalidates certain segments of previously generated IDs?

1. https://github.com/sqids/sqids-rust/blob/9f987886bc06875d782...

timwis
0 replies
7h2m

Include a version prefix perhaps?

ec109685
0 replies
18h0m

They address it here: https://sqids.org/faq#future-blocklist

You're right that you can't update the blocklist in a backwards compatible way.

Guillaume86
0 replies
15h51m

Didn't check the code but you could encode offseted hash + the offset to avoid blacklisted words and the decoder would decode any version of the hash (offseted or not).

filleokus
3 replies
19h21m

Not Good For:

User IDs - Can be decoded, revealing user count

Suppose you don't want to leak the count, what's a resonable way of implementing that?

You can of course have a uuid v7 / uulids or something as the primary key. Or have it as a public facing primary key, mapping back to a sequential ID PK (there might be some performance hits with larger PK's in e.g postgres? or is that just fud?)

But you could also generate a public ID with something like encrypt(seq_id, secret) and then encode it with whatever alphabet and or profanity filter you'd like - right? The issue then is that all public ID's would be long (and of course dealing with a decrypt operation on all incoming requests).

Don't know what's best really.

8n4vidtmkvmk
2 replies
19h6m

Add an offset, multiply by a large prime number, and modulo. I don't think you can recover the original number without figuring out the prime.

filleokus
1 replies
18h15m

Ah, that's neat. Why is the offset necessary?

erhaetherth
0 replies
16h9m

Might not be, but I like to start with a big number instead of 0 or 1 to fill all the bits. For example, if your prime is 100019, then your first number in binary is 00011000011010110011 but if your max number is something like 2^53 (00100000000000000000000000000000000000000000000000000000) then you have a lot of unfilled bits. The way I have mine set up, the output ID is always exactly 9 chars. 99% of the time it's naturally 9 chars because just by probability most of the numbers will be large, but some of them come out 8 chars and then I just pad up to 9 so it's nice and consistent.

dumbo-octopus
3 replies
22h39m

Odd design decision in that if you provide your own blocklist, it overwrites their (extensive) default list instead of adding to it.

And in general the algorithm is surprisingly complicated for something that could be replaced with simply base64 encoding, the given example (1,2,3) base64 encodes to a string with just one more letter than this algorithm.

That said I do appreciate the semicolon-free-style. I don't typically see that in libs besides my own.

https://github.com/sqids/sqids-javascript/blob/main/src/sqid...

8organicbits
2 replies
19h49m

The problem is their block list will change over time. If you don't override it, then your IDs won't decode right when you update. This is a huge risk.

You have to account for scenarios where a new word might be introduced to the default blocklist

https://sqids.org/faq#future-blocklist

Honestly, I think they need to rethink this. Otherwise you've got different library versions for different languages each using different default blocklists, none of which are compatible.

wtetzner
1 replies
16h49m

If you don't override it, then your IDs won't decode right when you update.

They'll decode fine. Encoding might change.

8organicbits
0 replies
9h10m

I think your right, but that violates the uniqueness guarantee. It doesnt seem like something that should change over time by default.

SebRollen
3 replies
15h6m

The second example for each language sample where the generated squid ends up being “B4aajs” essentially reads as “P0ooop” to a Swedish speaker.

Which is fine, they don’t propose to filter “bad” words in other languages, but kind of funny when that’s one of the highlighted examples, right next to the goal of filtering words. Goes to show how hard it is to filter profanity generally for international audiences

SheepSlapper
2 replies
14h50m

Could you just remove vowels and hit 99.9% of profanity in all languages? Ditto for removing their 0-9 equivalents, if you're really worried about it. Quick out of the box support for that via being able to define a custom alphabet.

jabagawee
1 replies
14h40m

You might still have issues with generating sequences like "XtrmlyBdWrd" that are still recognizable.

SheepSlapper
0 replies
13h10m

Well until we figure out a way to remove pattern matching from humans... use GUIDs if that's an issue. Removing vowels fixes "spelling almost all bad words explicitly", though I'm open to being proven wrong with fun new swears in exotic (to me) languages :)

The problem of "pick any N symbols that don't make any profanity in any language across all time" isn't what this is solving, nor should it have to. Take the same concept but use whitelisted words to build the token if you're that adverse to computer generated, fill in the blank naughty words. Keep "pen" and "island", among other things, off that list ;)

ComputerGuru
3 replies
20h42m

I haven’t been able to find a case for this because ids either need to be unique or they’re not going to be large. If they’re unique, I’m using uuid or ulid (uuidv7 of tomorrow) as the sortable primary key type to avoid conflicts without using the db to generate and maintain sequences.

Where do you have unique ids that aren’t the primary key? I would be more interested in a retrospectively unique truncated encoding for extant ulid/uuid; ie given that we’ve passed timestamp foo, we know that (where no external data is merged) we only need a bucketed time granularity of x for the random component of the id to remain unique (for when sortability is no longer needed).

Or just more generally a way to convert a ulid/uuidv7 to a shorter sequence if we are using it for external hash table lookups only and can do without the timestamp component.

swyx
0 replies
19h26m

why is ulid the uuidv7 of tomorrow?

canadiantim
0 replies
17h49m

I’m currently using sqids as slugs to have a shorter url than just using my uuid primary key

Bytewave81
0 replies
20h27m

The idea is that you encode and decode database IDs with this. You wouldn't save them separately unless you were using it for a purpose other than shareable "identifiers" which don't leak significant amounts of database state. Imagine something like a link shortener where you want to provide a short link to users, but don't want it to just be a number.

jsf01
2 replies
22h36m

What’s the use case for passing in an array of numbers? Typically when generating an ID my input is either a single random number, a string that’s being hashed, or nothing at all.

4kimov
1 replies
22h19m

[shard_number, primary_id_number, timestamp]

dumbo-octopus
0 replies
22h15m

But then why not just arbitrary text?

Schnitz
2 replies
20h12m

It would be great to have a quick primer on why this is better than what people typically homebrew, like base62 encoding a random number.

sneak
0 replies
20h10m

Database PKs usually aren’t random, which AFAIK is what is usually used as the number in this case.

8organicbits
0 replies
20h3m

If you use a random number then you need to store it somewhere to map back to the original. Sqids is an encoding, you can decode the sqid back to the original without storage overhead.

Features like the profanity filter avoid creating URL routes like /user/cuntFh.

Cross language support allows interop between the encoder and decoder across microservices written in different languages.

BraverHeart
2 replies
17h30m

I see many people in this thread saying that this is a good way to hide insights from ids/numbers, I don't understand, aren't the generated values easily decoded? couldn't I just decode a couple of numbers to get that insight? What am I missing.

jonasdoesthings
0 replies
17h26m

The docs state:

  Not Good For:  
  [...]  
  User IDs  
    Can be decoded, revealing user count
So yeah, just using a sequential id and encoding the number with this library is not a viable idea if you want to hide your insights.

hot_gril
0 replies
10h44m

I noticed this too, and imo it's a design flaw that people get misled this way. Sequential inputs should yield sequential outputs, otherwise you might think it's meant to be unpredictable like SHA256.

waffle_ss
1 replies
20h40m

I wrote a Ruby gem to address this problem of hiding sequential primary keys that uses a Feistel network to effectively shuffle int64 IDs: https://github.com/abevoelker/gfc64

So instead of

    /customers/1
    
    /customers/2
You'll get something like

    /customers/4552956331295818987
    
    /customers/3833777695217202560
Kinda similar idea to this library but you're encoding from an integer to another integer (i.e. it's format-preserving encryption). I like keeping the IDs as integers without having to reach for e.g. UUIDs

wslh
0 replies
15h56m

I recommend to review my comment where I also use a Feistel cipher [1] but the difference is that it is not limited to int64 but I can even use 8 bits. Obviously loosing security properties but working as an obfuscation method. If you use a random source with relatively few bits you should check if there are duplicates while with the Feistel cipher you are sure there isn't.

[1] https://news.ycombinator.com/item?id=38418198

its-summertime
1 replies
22h0m

For a similar thing, (X bytes to X bytes, no collisions) https://en.wikipedia.org/wiki/Format-preserving_encryption is a good page

jchook
0 replies
21h34m

Also see Knuth Hash and k-dimensional equidistribution.

canU4
1 replies
22h10m

Sad that it is not for user ids

8organicbits
0 replies
20h10m

I think that's only if you don't want to leak user count when your ID is an autoincrement. Elsewhere people mention cryptographicly remapping integers, which could work (by itself, or before passing the ID to sqids).

Use
1 replies
20h23m

Why should you hide your user count?

sneak
0 replies
20h8m

The rate of change over time can be used against you; many people consider their businesses’ month-over-month growth (or lack thereof) to be private information.

“$WEBSITE did 50,000 signups a month during the beginning of the pandemic, but now struggles to sign up a thousand a week” is a story.

Alifatisk
1 replies
1h18m

On a side note, "Sqids ... is an open-source library that lets you generate YouTube-looking IDs from numbers.", "The main use of Sqids is purely visual."

If the purpose of it is to give a friendlier url / id, who not use something like friendly_id instead? (http://norman.github.io/friendly_id).

The url is readable and searchable through the history.

I would much rather prefer people using "www.website.com/channel/video/a-dog-walking" instead of "www.website.com/channel/video/3cXv8c".

chrismorgan
0 replies
8m

These are called slugs <https://en.wikipedia.org/wiki/Clean_URL#Slug>. If you’re willing to commit to a persistent slug, please do: they are nicer. But for many applications, e.g. most user-generated content, you can’t be confident they won’t change. There’s a reason for the general advice to not use data as a primary key in databases, but to use a separate ID.

3cats-in-a-coat
1 replies
22h1m

I don't get it, that's like two lines of code, why does it have a library and even a domain

k2xl
0 replies
21h59m

Also wondering this

zie
0 replies
1h13m

What we did 20yrs ago was generate random numbers and then ensure the # wasn't already used. We do this only for employee ID #'s and it works fine for us. It's definitely not ideal though.

urza
0 replies
21h4m

I wanted to say that I use similar project called HashIDs, but I see that HashIDs rebranded to Sqids :)

swyx
0 replies
19h49m

saving it to my list of uid implementations https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...

packetlost
0 replies
22h5m

The name (but not function) seems really close to squuids from Datomic/Clojure.

mfbx9da4
0 replies
6h49m

Was hoping for more of a high level technical explanation of how it differs from alternatives in the FAQ

kryptogeist
0 replies
19h7m

Damn, those squids are getting smart

dustingetz
0 replies
19h22m

anyone have a copy pasta for the widest possible alphabet (i.e. extended unicode safe chars)

darigo
0 replies
12h23m

Makes me think of something totally different, which is how Urbit converts randomly generated numerical IDs into sorta memorable names.

For example, the number 5,702,400 becomes ~sorreg-namtyv. And 1,742,733,824 becomes ~master-morzod.

I enjoy the quirkiness of the names.

chupapimunyenyo
0 replies
20h35m

Hashids seems way better than their new implementation

c2xlZXB5Cg1
0 replies
22h52m

Reminds me of proquints https://github.com/dsw/proquint

But 127.0.0.1 looks more "readable" to me than lusab-babad

bufferoverflow
0 replies
19h35m

Do we really need a library for that? Shouldn't it be a simple function?

andrewstuart
0 replies
18h30m

What is the decimal range of these values?

akoboldfrying
0 replies
17h45m

This looks handy.

Given the retry-on-bad-word feature, I was sceptical of the no-collision claim -- but after looking at the JS source code, I'm confident it's correct.

For each number encoded, one character from the alphabet is "held back" to use as a delimiter (with the particular character chosen changing as each number is processed, presumably to make the output "look more random"). The very first character output is essentially a "free choice" that selects the initial permutation of the alphabet to use.

The algorithm is implemented as a function encodeNumbers() that calls toId() to encode each number, then checks for bad words and recurses with a new value of "increment" if it finds any. To prove correctness it's helpful to imagine an "in-between" function, tryEncodeNumbers(numbers, alphabet), which the outer encodeNumbers() calls in a loop to do most of its work (pseudocode):

  function encodeNumbers(numbers) {
    offset = sum(numbers)
    do {
      result =  tryEncodeNumbers(numbers, permute(alphabet, offset + increment++))
    } while (badWordIn(result))
    return result
  }
Here permute() is a function that permutes the characters in its first (string) argument according to its second (integer) argument in some arbitrary way.

The toId(num, alphabet) function, which simply encodes a single number using "digits" taken from the characters in the alphabet parameter, is clearly injective with respect to the num parameter provided that alphabet contains no duplicate characters -- that is, if we hold some duplicate-free alphabet string fixed, every distinct value of num produces a distinct encoded string as output. (For example, toId(42, "0123456789") gives "42", and no other value of num produces this string when the alphabet remains unchanged.) tryEncodeNumbers(numbers, alphabet) first outputs a character representing alphabet (i.e., its initial alphabet -- which is its complete internal state), then joins together a bunch of these toId()-encoded numbers, with an extra character in between each that is known not to appear as a digit in the preceding number, permuting the alphabet in an alphabet-dependent but num-independent way for each encoded number output. Because the alphabet permutation is independent of the input array of numbers, this means that the alphabet used to encode the i-th number depends only on the initial alphabet and i. This means that, again holding its initial alphabet fixed, tryEncodeNumbers() is likewise injective with respect to the input array of numbers. (Suppose it were not: Then there is an initial dupe-free alphabet, and two distinct arrays of numbers, that produce the same output. Find the first position where the two arrays differ. Since the final results are identical by assumption, one of the two encoded outputs for this position must be a prefix of the other. But if the two encoded outputs are of different lengths, the shorter one must either terminate the entire string, making it shorter than the other encoded string, or be immediately followed by a character that cannot appear in the output of toId() with the alphabet used at this position, both of which contradict the assumption that the resulting strings are equal. Therefore toId() must have output identical strings for the two different numbers at this position, when given the same alphabet. But this contradicts injectivity of toId(), so this (non-injectivity of tryEncodeNumbers()) is impossible.)

The final step is to see that, if encoding two inputs with the top-level encodeNumbers() function gives the same first character C, it must be because the first successful (bad-word-free) loop iteration for the first input passed the same alphabet to tryEncodeNumbers() as the first successful loop iteration for the second input. If the remainders of the two encoded strings are also equal, then by injectivity of tryEncodeNumbers() for fixed alphabet choice their input number arrays must have also been the same. Since no restrictions were placed on the two inputs, this holds for all possible input pairs -- that is, it is impossible for encodeNumbers() to produce the same encoded output string for two different inputs.

ForHackernews
0 replies
17h7m

Oh, they finally fixed the badly named "hashids"

8organicbits
0 replies
21h51m

The mention of one-time passcodes seems odd. Those need to be unguessable, but don't need to be unique. If you supply a suitable random source, then I suppose it works, but the "padded with junk" feature makes these look more complex than they really are.

The standard choice of 4 to 8 random digits works well and it's clear what level of security they provide. Digits are easier to understand than case sensitive latin characters, especially when your native language uses a different character set.

1-6
0 replies
21h8m

Sqids vs Squids. Missing the 'U' for unique but nevertheless a unique shortened version of the regular spelling.