Bluesky migrates to single-tenant SQLite

That looks like the PR from hell - 190 files changed, 143 commits? Mostly with names like "tidy" and "wip"

Props to whoever actually reviewed that, you are a warrior

I prefer to read the unified diff and commits don't matter as much.

don't know why, but recent teams around me have always made strict rules about number of commits in PRs. I just wanted to tell them the same thing you said: "Why don't you just look at the diffs?" curious for other opinions. (sorry not really about this particular topic)

I prefer to have clear commits that tell a tidy story. For example:

* Refactor function `foo` to accept a second parameter

* Add function `bar`

* Use `bar` and `foo` in component `Baz` to implement feature #X

If you give me a commit history like this, I can easily validate that each step in your claimed process does what you describe.

If you instead give me a messy history and ask me to read the diff, you might know that the change to file `Something.ts` on line 125 was conceptually part of the refactor to `foo`, but I'll have to piece that together myself. It's not obvious to the person who didn't write the code what the purpose of any given change was supposed to be.

This isn't a huge deal if your team's process is such that each step above is a PR on its own, but if your PRs are at the coarseness of a full feature, it's helpful to break down the sub-steps into smaller (but sane and readable) diffs.

Funny that two of your commits don't actually tell us why they exist, one simply describes the diff (which you should never need lol?) and the other proxies that responsibility to some other system.

You could have simply randomized the text in each commit, put the ticket id and the one "why" in the merge commit body and gotten the same end result amount of real information in the end.

The first line of the commit message isn't about including information that couldn't be gleaned from the commit. That can be done in subsequent lines. The first line is for two purposes:

* Priming the reader so they are able to quickly interpret what they're seeing when they open the commit.

* Making it easy to search or scan for a specific change.

The last commit message in my example would probably have included the name of the feature as well as the ticket number, but I couldn't be bothered to invent an actual feature name.

DRY doesn't really apply to technical writing, at least not as extremely as you seem to think it should. Headings are supposed to summarize the contents, and that's what commit messages are: headings.

I like to leave comments like this too:

loop i up to n times

break when false

check value returned is not null

This is reasonable, but the problem I encounter is how stifling it seems to ask others to structure their work so specifically. By way of comparison, getting compliance on conventional commit messages is a challenge, and that's an appreciably smaller ask than this.

Oh, for sure. This is how I structure my own PRs, but I've certainly never bothered to ask a coworker to do so, I just appreciate it when I see it.

That said, OP is in an environment where it sounds like this kind of structure is already the cultural norm.

From another one who tries to do the same (but doesn't enforce it):

Thanks!

In the context of Github PR you can’t leave reviews on commits other than what’s currently the tip commit of the pr branch so structuring this way is just wasted effort.

What you should be doing is breaking down PRs more finely so that your unrelated refactors are all separate single-commit PRs. That ofc requires that your pr review round trip time is fast

I'm pretty sure I've left comments on a commit before in a GitHub PR. The comment just goes in the right place in the PR diff, assuming no changes, or comments can actually be attached to commits themselves (which is what happens when a comment becomes stale—it retains a reference to the original commit).

Commit and push often. Put a novel explaining yourself in the PR. And that's enough IMO.

Commit and push often. Put a novel explaining yourself in the PR. And that's enough IMO.

Someone reading the git changelog 5 years down the line most likely wouldn't be able to find your "novel" in the PR and definitely won't appreciate if instead of a "novel" you ended up with a "short call" with the assigned reviewer explained what you actually did in your 50 "wip" commits.

Someone reading 5 year old git logs is lost to begin with.

When debugging I routinely explore git blame and read the changelog. This sometimes leads to 3, 5 or even 10 years old code. Doesn't mean I'm lost.

A good practice is to rebase your commits before creating a PR into a single commit. You are free to commit as many times as you want to while doing your work. This minimizes the noise in the log.

It's only a good practice if the PR is a single logical change.

Easy workaround. Start with feature branch f.

1. Branch f-prime from master. 2. Squash merge f to f-prime. 3. Pull request f-prime to master. 4. Profit.

Squash is our git given right.

Commit your code and commit it often. There's no reason not to.

Sure, but then there's nothing wrong with rebasing it and making a nicer story for other people that want to review it.

Diffs are great but sometimes they're just as overwhelming in a huge PR. It's nice to first follow 5-10 commits in chunks of logical change.

Dont send huge prs. They are hard/impossible to review anyway with good commit history or not

I don't know why people are obsessed with squash merging. I always rebase (when needed) to preserve commit history. It's a good best practice, and makes it easier to spot errors after fixing conflicts.

I suspect squashers use the wrong tools. Use source tree, or, if you are on linux, smartgit. You can see a detailed log, which makes it much easier.

Sure, commit often while you're working.

But then when you're done, turn it into a series of patches for a reviewer to read. In the words of Greg Kroah-Hartman, "pretend I'm your math teacher and show your working".

In a maths assignment, you spend ages making a big mess on a scrap of paper. Then when you've got the solution, you list the steps nice and clearly for the teacher as if you got it right first time. In software development, if you're not a dick, you do the same. You make a big old mess with loads of commits, then when you're done and it's review time, you turn it into a series of tidy commits that are easy for someone to review one-by-one.

Why on Earth did people flag this? Indeed, you won't have a good time sending series of 50 "wip" commits to any kernel mailing list. Having a good split with proper commit messages and cover letter will both make your code much easier to understand for current reviewers and any future "code archeologist" who will have to fix bug in that code 10 years down the line.

Am I living in a bubble and all the glorified 500k TC FAANG devs from HN really routinely submit a changes consisting of a tangled mess of 50 "wip" commits for their code review without any repercussions?

Commit and commit often, but then clean up the history into discrete, readable chunks.

If your PRs are tiny it's not a big deal, but with 190 files changed in this one, it absolutely should have been rebased into a more reasonable commit history.

Also continuously integrate (from trunk) if you want to hit that moving target sooner.

Unless you’d like to maintain your train of thought.

I don’t want to interrupt my flow with intermediary commits.

Same. Do whatever you want in your feature branch, what matters is the Files list and the description in the PR. The whole thing gets squashed into a single commit anyway (which also makes reverting much easier).

Reverts are also easy even if one merges the whole branch. Just revert the merge commit.

I almost never look at them, but once in a while it is really great to see the thought process that led to something.

I don't think any method is gonna make it easy to grok 3,336 added lines and 5,421 removed

This is the answer.

What if it was 190 files changed in 1 commit, would that make a difference?

It might.

With commits like "typo", you might as well squash these into the commit which introduced the typo in the changeset.

If there are changes across many files, and the changes were made automatically with some search-and-replace (or some refactoring tool).. by having a commit that's only that automatic change, it's easy to look at that commit and tell what the changes were. -- Presumably, non-automatic changes are going to be smaller.

I guess roughly, if it makes sense to apply a changeset that changes 5 things, you'd want 5 commits. Having commits like "typo" means there are more commits; but squashing those 5 things together makes it harder to discern the granular change.

lgtm

Props to whoever actually reviewed that, you are a warrior

Or a ghost.

Those two work very closely together, so probably not as nightmarish as it may appear to an observer. But, the two of them are most certainly warriors.

I've got a bunch of invites if folks want them:

bsky-social-etdu7-njigu

bsky-social-2ktcs-uwoxg

bsky-social-6f5nh-36gnq

bsky-social-ciwro-3gzk5

bsky-social-y4h57-dxh3g

Grabbed bsky-social-6f5nh-36gnq, thanks!

Damn, they all gone.

There you go folks:

bsky-social-h3d4w-u6yn4

bsky-social-74bqi-vkmcq

bsky-social-n3fdq-46nxz

bsky-social-yippe-32vdr

bsky-social-l2fbt-xnscx

Gone in 60 seconds.

Any more?

bsky-social-lbjkg-gcxs4

bsky-social-zigwm-f3qpq

bsky-social-2jlu7-apy5a

bsky-social-6ct52-4egmz

bsky-social-cy64m-53sqn

Maybe if you stop posting them with the easily-greppable first part they won't be so easy to scrape.

No one seems to be taking two codes I'm putting up without the prefix for ~hour, this is likely the case

e: second one now used, first still up

e: both used

I wonder what they're being used for. The UI doesn't expose it, but the Bluesky API will tell you who redeemed your invites. Open the site, watch for a "com.atproto.server.getAccountInviteCodes" request in your browser's network inspector, look in the "usedBy" field in the response JSON, and append the DID value there onto "https://bsky.app/profile/". Any commenters in the parent chain who got scraped want to take a look?

I get "$username joined using your invite code!" in notification tab that leads to the user profile. So far the user hasn't done anything.

grep "bsky-" internet.txt

Ya'll got anymore of that.

Any more?

Some more for y'all!

bsky-social-ge2mz-mfmpi

bsky-social-hykwa-x3ox4

bsky-social-gh4mt-2od6p

bsky-social-dejzy-mmcxf

edit: all gone :(

Snagged bsky-social-hykwa-x3ox4

4 more: (prepend bsky-social-)

7poji-p36pm

irn4h-ncvic

2hb2e-xhxnb

2k4na-5qiqu

And they are gone.

Either people were really prepared for these codes to appear, or they are being scraped. Regardless, they're all gone

It seems a temporary, anonymous, private, receive-only dropbox(not the USB drive replacement kind) on the Internet is an unsolved problem. It doesn't have to be completely out-of-band like email, could be just an encrypted public reply by `cat | base64 -d | openssl rsautl -decrypt -inkey temp.key`, so long up to few bits(70 in this instance) of encrypted content would be allowed on a platform.

piracy websites were using base64 encoding for this purpose a while ago, but now it seems they moved on to a proprietary algorithm

e: burner email didn't work, sorry

e4: check out dns on [my username].com

can't believe it's gone, too many smart people on this website lol

Not called hn for nothing! Glad I'm absolute bottomest on the floor in terms of intelligence or ability here

Nope that might be me. I started this comment thread and I didn't even manage to snag one despite getting direct replies multiple times with codes.

Thanks a lot!

These seem to be gone.

to the people doing this: your codes will most likely be instantly stolen by bots and not real people

The codes are all gone. That was fast.

E: Happy to take one, if somebody happens to have a spare one left. Email is in my bio.

they're all exhausted now :(

Is Bluesky still invite only?

Yes. I have invite codes if you would like one. Email in my profile.

Edit: they're all gone!

Is your offer invite code available to other randoms like myself? I tried to register on bsky months ago and still haven't been approved.

I have some extra if you'd like one. Let me know how to get it to you and I will.

Would also like one if you have an extra. Thx in advance. (Click on username to see my email in profile)

Sent you a code.

I’d also like an invite if anyone still reading has any.

I'd love to have one if you have any left. (email in my bio)

Got an invite from another user. Thx!!!

If you're offering I'd love one. My email is my username on hackernews at gmail.com

Sent you an invite code.

Could you send me one? Email in my profile. Thanks.

I'm still trying to get one if anyone see's this. Keep missing the ones posted. Email in profile.

Thanks.

I'm happy to give them to any HN user but I'm afraid I have only three left and there are three emails in my inbox asking for invites, so if one is you then congratulations! Otherwise, sorry.

Was browsing around your website (mentioned in profile), noticed https://0x85.org/contact.html only mentions Twitter and email. Maybe the bluesky omission is intentional, but probably it just hasn't been updated yet? I'm not on bsky myself, currently having fun on mastodon and I'm not familiar with bsky enough to know what I'm missing out on, but for other folks I figured I'd mention it

Hey thanks. It's just outdated, what with young kids and grad school. Appreciate the note.

if anyone still has any codes, please DM me one, email in my profile. Thanks muchly

I'll also add my 4 invite codes if anyone wants them

EDIT: I'm fresh out for now, sorry!

I'd also love a code if anyone has any to spare (email in bio)

If anyone else still has invites I am also interested.

My mail is at the bottom of my bio.

I still have a single invite left.

I'll take it if you still have it? chris@extrastatic.com.

Sent.

Thanks, and really enjoyed your blog while searching for your email! :)

It is, but not as a "growth hack" or anything. It's just a way of limiting growth while the system is scaled (in terms of the backend and abuse prevention).

There's a dedicated waitlist for developers that will get you access quite quickly: https://atproto.com/blog/call-for-developers

It's pretty hard not to see it as a growth hack given that posts can't even be viewed without an account. That seems pretty transparently to be a system to create a feeling of FOMO/exclusivity, to make it so that you don't only need an account to participate, you need an account to even see what the network is or to follow anyone on it at all.

As a comparison, Cohost limited account setup when it launched as a way to limit growth. But it didn't lock viewing the entire site behind an account requirement because... come on. What does that have to do with scaling, we all know why that restriction is there :)

To be fair, it seems to be working. Needing to seek out and find invite codes means that signups are more visible -- signup codes get shared over social media and that means mentioning Bluesky publicly and keeping it in people's minds. It also forces people to ask publicly about access, which makes the network feel more exclusive and turns every signup or expression of interest into an advertisement for the network. It's a good marketing strategy, and I suspect that a nontrivial portion of Bluesky's current buzz comes from that marketing strategy, so I can understand why it hasn't been abandoned yet. I mean, look at the current thread; if people didn't need to coordinate publicly on HN to get access then this subthread wouldn't exist and then there wouldn't be a public thread where a bunch of people express interest in trying out the network -- and that publicly expressed interest in this very subthread makes Bluesky feel more in-demand.

In fact, this is such an effective marketing strategy that I've seen Bluesky users complain that invite codes are too common now and that their invite codes aren't in as much demand as they used to be. That FOMO loop is so powerful that it's even affecting the people who already have access to the network who enjoyed the feeling of being in control of an artificially scarce resource.

But sure, all of this is definitely not a growth hack, I believe you ;)

Regardless of whether it's good marketing, the account requirements make the platform a lot less relevant in any serious discussions about the direction of social media, because despite its plans for the future for federation and access, what Bluesky is today is a platform that is in practice even more locked down than Twitter is.

Cool, but maybe let people actually use your service before everyone forgets what it is?

They have over 1.8 million users currently, or do you mean PDSes specifically? Federation is in open beta on a test network, you can try it out today if you'd like.

I have been on the wait list since they launched. They seem to mostly rely on invites.

  bsky-social-scbch-eolha
  bsky-social-fs26y-d6gnv
  bsky-social-2lx5u-ntrdv
  bsky-social-hboq7-dyuue
  bsky-social-b2v3f-3a23q

damn, seems already all gone

    bsky-social-lkzsp-7x7ja
    bsky-social-p4vwr-nrthu
    bsky-social-bdu6c-6tbv4
    bsky-social-fkpgk-oestw

Got one, thank you sir!

Thank you so very much! :-)

I have a few invites. Email me and I'll pass them out. :)

all gone

I’ve some invites lying around. DM me if you want one.

I'd like to take you up on that, if you still have one going?

It shouldn't be too difficult to find an invite? They hand them out pretty frequently.

Got a few invites, DM me on twitter, substack or masto if you want them (listing on https://bitecode.dev)

Come on, "over 1.8 million users" is not an impressive number

These kind of movements makes me think they're not serious about scaling up. Wouldn't surprise me if then end up as an also-ran

Maybe not impressive but none of the services of my customers had or has 1.8 million users. And yet they do well (my customers.)

That. And none of the big social media platforms were big at the start either.

Yes but 1.8Mi at a time where people are longing for a Twitter alternative is just leaving money on the table

There are two usual strategies for growing: 1) low cost, organic and slow or 2) high cost, throw a lot of money at advertisement, saturate all media, grow quickly or bust.

The exceptions are those rare products that despite a low cost marketing sell themselves so well that their organic growth is fast and in a few months everybody use them.

Maybe Bluesky don't have the money to advertise or is not compelling enough. As one data point: I know about Mastodon but I think that I learned about Bluesky only today. I went to their site and there is nothing to explain how it works except that it's some social thing. I learned more by reading the comments here. Apparently it's being marketed at a very low cost.

They have over 1.8 million users currently

How many of them active?

Why sha256 hash the user into to get a two character target directory? Wouldn't md5 be much faster and solve the same problem?

At their scale maybe they're worried about collisions?

Or, like me, they're drowning in security tooling from corporate and don't want to have to carve out exceptions for md5 usage in each.

At their scale maybe they're worried about collisions?

With their scheme, collisions are already guaranteed to happen if they have >256 users.

I guess parent meant abusable non-uniform distribution of collisions (they have collisions anyway as the take only the first two characters according to GP comment)

It could be they didn't want to explain the md5 usage, yeah. But that's kinda nuts if they do this every query.

It's probably not healthy to have broken cryptographic hashes running around. If you don't need a secure hash there are plenty of fast non-cryptographic hashes.

There's nothing about security here. By this logic you should probably stop using hashmaps, then? :)

That's literally not their logic.

They said:

if you need security don't use md5.

If you don't need security, use something faster than md5.

md5 is neither secure nor fast, why use it at all?

This is probably not about collisions but about filesystem limitations (max number of files in a directory).

I've done something similar and that's absolutely what it was. I'm no pro, knew I wasn't doing it the right way, but it was for a personal side project and Windows starts to get weird when you have a million files in a single directory.

having a good hash uniformly distribute content helps scaling (by sharding of data)

At a guess: that hash is performed relatively few times, so any performance difference is lost in the noise floor. Never having to answer "why did you use this insecure hash" or eliminating/minimising any possibility of a class of security problem is worth more.

This has nothing to do with security. It's just wasted CPU. I imagine you have to do this every time you make a query to lookup the users DB?

Security is not a concern here. It's just literally bucketing ids. Also, this is not needed with modern file systems.

Love SQLite - in general there are many challenges with a schema or database per tenant setup of any kind though. Consider the luxury of row-level security in a shared instance where your migration either works or rolls back. Not now! If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out. Now, yes, if you are at sharding scale this may occur anyway, but consider that before you hit that point, a single database is easiest.

You will possibly want to combine the data for some reason in the future as well. Or, move ownership of resources atomically.

I'm not opposed to this setup at all and it does have its place. But we are running away from schema-per-tenant setup at warp speed at work. There are so many issues if you don't invest in it properly and I don't think many are prepared when they initially have the idea.

The funny thing is that about a decade ago, the app was born on a SQLite per tenant setup, then it moved to schema per tenant on Postgres, now it's finally moving to a single schema with RLS. So, the exact opposite progression.

If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out.

That shouldn't be a big issue. Any service large/complex enough to care does the schema upgrades in phases, so it's 1. Make code future compatible. 2. Migrate data. 3. Remove old schema support.

So typically it should be safe to run between steps 1 and 2 for a long time. (Modulo new bugs of course) As an ops-y person I'm comfortable with the system running mid-migration as long as the steps as described are used.

That shouldn't be a big issue. Any service large/complex enough to care does the schema upgrades in phases, so it's 1. Make code future compatible. 2. Migrate data. 3. Remove old schema support.

Exactly this, schema migrations should be an append, deprecate, drop operation over time.

I wish there were ways to enforce this on the db so you never accidentally grabbed a table lock during these operations.

definitely have shot myself in the foot with postgres on this

I wish there were ways to enforce this on the db so you never accidentally grabbed a table lock during these operations.

You can use a linter for PostgreSQL migrations https://squawkhq.com/

And squitch is a wonderful Perl tool for this as well

now you have people on different schema versions until you figure it out.

That can be a good thing if your product has say < 100 customers. As each might have different upgrade timelines and needs. I even know of business like this who do custom work for some so they essentially aren’t even running the same code (gasp).

I guess it depends on the business structure.

Totally correct. But not a good thing in our case!

The funny thing is that about a decade ago, the app was born on a SQLite per tenant setup, then it moved to schema per tenant on Postgres, now it's finally moving to a single schema with RLS.

To be fair, RLS was not available yet a decade ago :) It appeared in PostgreSQL 9.5 in 2016.

If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out. Now, yes, if you are at sharding scale this may occur anyway, but consider that before you hit that point, a single database is easiest.

This can be accounted for and handled. Though if schema issues are enough of a scare I wonder if a documentdb style embedable database like a couch/pouchdb might make more sense.

I dont know - I have experience working with monster DBs in production and never again. Under large enough load every change becomes risky because you can’t test performance corner cases fully. Having a free-tier user take out your prod because they found a non-indexed code path is also classic

https://blog.turso.tech/introducing-embedded-replicas-deploy...

https://electric-sql.com/

What do they mean by "Since SQLite does not support concurrent transactions" - it supports them, as long as you don't access the .db file through a file share (UNC, or NFS, etc) - https://www.sqlite.org/wal.html

I've been using this to update/read db from multiple threads/processes on the same machine. You can also do snapshotting with the sqlite backup API, if you want consistent view, and to not hold on transaction (or copy in-memory).

But maybe I'm missing something here... Also haven't touched sqlite in years, so not sure...

Writers merely append new content to the end of the WAL file. Because writers do nothing that would interfere with the actions of readers, writers and readers can run at the same time. However, since there is only one WAL file, there can only be one writer at a time.

I think the OP meant that updates have to run sequentially.

Which just means the lock happens at user scope in this case instead of per table or row. This limitation still causes so much confusion when it’s a completely reasonable design.

I've been importing data to sqlite databases running being actively written to for years. Just throws exception if the database is locked and I retry. Do 10k row batches, with a small sleep between. No issues. Helps if your use case doesn't really care about data being in order I guess.

Have patience and eventually hctree [1] will become stable and will be offered to us to choose between its traditional backend mechanism and the newly implemented to support concurrency!

[1] https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html

Did you disable auto checkpointing? Wouldn’t checkpointing result in potential corruption or at least data loss if two processes do that simultaneously? Or is that scenario exhaustively prevented with a lock file?

Nope! I was mistaken - it's really multiple readers, single writer - Probably I was assuming things the whole time, and did not spent enough time thoroughly checking - granted most of the db's I've done with sqlite were more about reading than writing.

So stand corrected!

Probably means there’s still no row-level locking at least last I checked, very limited table-level locking. The writer still grabs a lock on entire db per docs

This is the tradeoff of SQLite, it is extremely fast, as long as you only mostly have one user. With WAL you can get multiple readers, but it doesn't scale the same way that e.g. PostgreSQL does

it works if there is low traffic, as soon as you got bigger transactions or the amount of concurrent writes becomes heavier, you will at some point (even with WAL enabled) get "database locked" issues. You can work around that on application level to a certain point, but in general, if you are at that point, you should really consider using another database backend.

I am curious, does the HN folks know if bluesky is more active than nostr or the mastodon network?

Less active than Mastodon, I'd assume more active than Nostr.

But the interesting thing for me isn't activity — it's the people on there.

Of the cohort who had >100k followers on Twitter, I think more of them post regularly on Bluesky than post on Mastodon. Bluesky definitely has a more cohesive feel, especially because there's currently just one instance & mod team.

Mastodon, and the Fediverse in general, make user interaction decisions on purpose to limit many of the issues common to social media. Think about: mob culture, addiction, and the like.

I wonder if BlueSky intends to follow on those. For example, hiding user actions counts (repeats, favourites, etc...) until the user acts on one.

Things like these may be strange for those accustumed to Twitter, but personally, is what makes me stick with smaller instances on the Fediverse.

My bet is regardless of any initial good intentions, since BlueSky is a company, market pressures will inevitably force them into dark patterns like we see on every other commercial social network (going back to the early days of the companies, Facebook, Twitter, and even Google looked really good early on until all were corrupted by profit motive). My belief is that the profit motive is necessarily at odds with free communication.

To me, the ActivityPub network (Mastodon and friends) is relatively unique in the social media space in having no direct commercial pressures (the protocol is developed by W3C) and therefore being inoculated against the causes for these dark patterns.

I'm a donating supporter of the Mathstodon.xyz instance, but (sadly?) most of "math Twitter", at least the education-focused university faculty, ended up on BlueSky. I think there's a strong appeal for "a straight forward Twitter clone without Musk" for a lot of people.

I don't know about nostr, but I find it is a lot less active than Mastodon. In general the tech accounts I am interested in have moved to Mastodon rather than bluesky. I imagine this would depend on whose activity you are interested in, and where they have chosen to migrate to

Bluesky is much smaller than Mastodon, but how active it feels will depend on who you're following. It also has an Algorithm (TM); I never really missed this when I went from Twitter to Mastodon as I mostly used the linear timeline anyway, but I gather that some people find that Mastodon feels empty/inactive without one.

Bluesky is pretty positive and definitely lacks the "American Suburbia HOA" energy that some Mastodon instances have.

It's pretty active during North American hours.

They’re all just arbitrary ghettos that aren’t dissimilar to each other. None of them matter in terms of influence but are like nice Reddit boards for certain interests.

Less active than Mastodon, more active than nostr.

https://vqv.app/stats/chart is useful for Bluesky and draws on the Bluesky firehose for data. https://stats.nostr.band/ seems useful for nostr.

Slightly related: is Bluesky moderated good enough or do I get lots of rightwing and conspiracy crap like on twitter currently?

I‘d really love to have some more civilized hub again that isn’t full of hate and anti-intellectualism.

Are you assuming that hate and anti-intellectualism are exclusively a rightwing thing?

Not exclusively, but on the mainstream internet in 2023? Yeah, more or less, bar a few tankies.

On Twitter, the place that hired Tucker Carlson after Fox News dumped him? Yeah it is. No need for "both sides"-ing on this one.

It's anti-intellectual and uncivilized, but not because of rightwing conspiracy content. There is a strong culture of intolerance and censorship of viewpoints that diverge from the norm.

It seems a lot nicer than Twitter. Though I'd wonder how much of that is just that it's invite-only right now. I haven't really gotten into it, for a variety of reasons (happy enough with Mastodon for most stuff, no decent client apps, vaguely suspicious of the involvement of Dorsey) but it seems... fine?

Get out of your bubble

I think it is still too small and people seem quite nice there. But, that has its drawbacks, as I keep returning to Twitter due to the slow migration in the recent months.

It is a shame, as it seems like a nice alternative that has some cool ideas.

Interesting... I like the strategy of having each user be 1:1 with a DB. What would be done for data that needs to be aggregated across users though? If I'm subscribed to another user and they post, how does my DB get updated with that new post? Or is this meant just for durable data and not feed data (like profile data, which users are followed / not followed / etc.) and all the interactive stuff happens separately?

I like that "connection pooling" is just limiting the number of open handles in a LRU cache. It's also interesting because instead of having to manage concurrency at the connection level, it handles it at the tenancy level since each DB connection is single-threaded. You could build up per-DB rate limiting on top of this pretty easily to prevent abuse by a given user.

Is there a straightforward way to set up Litestream to handle any arbitrary number of DBs?

https://blueskyweb.xyz/blog/5-5-2023-federation-architecture

To summarise the relevant details, the "AppView" service is responsible for the sorts of queries that aggregate across users, and that has its own database setup - I think postgres but I'm not 100% sure on that.

You're right, as usual. AppView is on a Postgres cluster with read replicas doing timeline generation (and other things) on-demand. We're in the process of moving it toward a beefy ScyllaDB cluster designed around a fanout-on-write system.

The v1 backend system was optimized for rapid development and served us well. The v2 backend will be somewhat less flexible (no joins!) but is designed for much higher scale.

Does the BGS pull all the tenant‘s individual SQLite data? Or do the PDS push new posts to the BGS?

The BGS (which is an atproto "relay" service) subscribes to all PDS event streams on the entire network, and aggregates and relays them.

This way it's possible to get all network data from a single place (the BGS) rather than having to connect to every PDS, which is simpler for consumers and dramatically reduces the workload of PDS hosts.

Some details about event streams here, although the APIs are still evolving: https://atproto.com/specs/event-stream

The BGS handles "big-world" networking. It crawls the network, gathering as much data as it can, and outputs it in one big stream for other services to use. It’s analogous to a firehose provider or a super-powered relay node.

"Big-world" networking by Big Tech-to-be Bluesky with super-powers, I wonder? Is this BGS also going to be federated, or is that the big centralized beating heart of this platform managed exclusively by BS?

This seems like a very misleading title, the Bluesky PDS is the meant-for-selfhosting thing they distribute, not the bluesky service as experienced and used by most of its users.

AFAIK there's only one version of the software so "the service" runs the same thing that you self-host. SQLite seems like it will simplify the single-user case though.

That's right. This is the same code Bluesky is running on our new PDS hosts. It's all open source.

The main motivation in moving from a big central Postgres cluster to single tenant SQLite databases is to make hosting users much more efficient, inexpensive, and operationally simpler.

But it's also part of the plan to run regional PDS hosts near users, increasing performance by decreasing end-to-end latency.

The most experimental part of this setup is using Litestream to replicate these many SQLite databases (there are almost 2 million user repositories) to cloud storage. But we're not relying on this alone, we're also going to maintain standard SQLite ".backup" snapshots as well.

no this actually is moving every single user currently on the service into this setup. Everyone gets their own sqlite under the hood.

The “Personal” in PDS doesn’t mean it is only for self-hosting.

Bluesky has a main PDS instance at https://bsky.social that serves almost all of the Bluesky user base.

There is a good overview of the architecture here:

https://blueskyweb.xyz/blog/5-5-2023-federation-architecture

Here’s a snippet from the protocol roadmap they published 3-4 weeks ago [1]:

Multiple PDS instances

The Bluesky PDS (bsky.social) is currently a monolithic PostgreSQL database with over a million hosted repositories. We will be splitting accounts across multiple instances, using the protocol itself to help with scaling.

[1] https://atproto.com/blog/2023-protocol-roadmap

On the surface, this looks like the worst combined with the awful. I hope someone will make a good article with some hard numbers to explain the benefits and analyze the assumed flaws, because this could be something really, fascinating to learn about.

Can you explain why this looks like the "worst combined with the awful" to you?

To me, on the surface, particularly assuming you are building a distributed system to be run and deployed by many users, some of which are not professional sysadmins (which I believe is likely to be a goal here, and should be), this seems like quite a sane choice. I'd definitely expect a design goal to be avoiding the need to setup/configure/look after any additional database or other servers.

This looks like someone is building their own filebased database-system, in typescript, while still using mature features of database-servers. So instead of trusting the optimized, regularly maintained and battletested solution, they build something by themselves. This smells ugly, like something that will scale poor in performance, and will have security and tooling-problems.

Simplification of installation seems not like a good enough reason to trust your whole backend on this. Installing and maintaining a database-server is not that hard today. This is well established and documented, unlike this. But I also don't know enough about this app, maybe this is just one of several options, meant for a specific usecase? Using this in a standalone desktop-app would make sense, while still offering a mature sql-backend available for server-installations.

I'm not a professional coder, only side-projects. Never formally taught. I looked at the solution and thought it kinda sounds like something I'd come up with. Like when I didn't know how to use data tables and would hold data in an array of arrays to form the rows and columns. Somewhat clever, "works", but would probably make my professional coder friends vomit if I explained it to them.

Using SQLite is most certainly not "building their own filebased database-system"

SQLite is just about as mature and well-tested as it gets in the entire world of software: https://www.sqlite.org/testing.html

Each users' data is naturally partitioned at the atproto repository level, so this is the sweet spot for per-user SQLite databases. It would make total sense for a PDS instance to have just a single user on it, and in fact that is likely for many self-hosters. It's also worth noting that the PDS software already had SQLite support, which made this change somewhat easier.

There are legitimte trade-offs to this kind of a system but it comes out way ahead in this case, and it's not as wild as it may seem to those not familiar with the power of SQLite.

A major consideration is that we're planning to run at least 100+ instances, which would require operating 100+ high availability (primary+replica) Postgres clusters. This would be a huge amount of operational and financial overhead.

We chose this route because it is better for us as a small team, with relatively limited resources. But it also has the property of being much easier for self-hosters, of which we hope there will be many.

Can someone that knows more about bluesky explain what data is stored in sqlite and not? Because i assume it isnt messages etc between users.

I assume that messages between users are stored in those SQLite DBs.

Think email. When you send an email and CC five other people as well then seven people now have the same copy of the email stored on their email servers. That is, there’s no central database that contains a single email that is referenced by others.

This is basically how sharding with relational DBs works as well.

This sort of data denormalization is almost a requirement as applications scale and especially for many-to-many applications that have a high write to read ratio.

Low write to read and you can get away with a single master to many slave relational DB architecture for quite astonishing numbers of requests and data!

By messages, do you mean direct messages (private messages between two parties)? Because Bluesky doesn't have those at the moment. There's only public messages broadcast to the world.

Haven't done any research to determine if there are plans for direct messages.

It's all your posts and replies as a user. While they currently host the only* PDS themselves, the end goal is for every end user to have their own PDS. Inrupt/SOLID calls this concept a "pod".

*(actually they just onboarded a second production PDS yesterday.. progress!)

Will the BGS also be federated, or is that to be the centralized big spider in Bluesky's web?

In theory you can migrate between BGSes, but you can always just use one at any point in time.

In practice no one will switch because it makes no sense to do it. If there happen to ever be more than one real BGS contender, it will be from something like Cloudflare that will just replicate everything Bluesky Inc decides.

I don't know if it does not make sense. AFAIU these BGSes could be special-purposed e.g. for a business, community or topic of interest. Why wouldn't it make sense to synchronise the collected data between these BGSes and get a combined view on the data? With just a single BGS we have another centralized big tech platform. I think decentralized BGSes are a major factor in how interested people are in becoming part of the ecosystem.

The BGS is a "dumb" relay and mirror of the network, so it generally shouldn't matter which one your client app is ultimately sourcing data from.

But yes, anyone is free to operate a BGS. It does necessarily require a non-trivial amount of storage, compute, and bandwidth. A funded startup, well-funded non-profit, or any just about any cloud provider could likely afford to run one.

It's also entirely possible to operate a BGS that only mirrors a slice of the network (for instance, only users in one country) if desired, which could in some cases make it affordable for a single user or small coop to operate.

Always happy to see more server SQLite/Litestream adoption which we've also been using to build our new Apps with.

SQLite + Litestream is an even greater choice for tenant databases, that's vastly cheaper to replicate/backup to S3/R2 than expensive cloud managed databases [1] (up to 3900% cheaper vs SQLServer on Azure).

[1] https://docs.servicestack.net/ormlite/litestream

What does 3900% cheaper mean? I don't get it.

yeah.. a weird way to say 39 times cheaper ;)

100% of, say 42 is 42. So 100% less than 42 is 0.

3900% cheaper makes no sense.

I sure hope they don’t ever want to change their db structure.

Why not use Postgres with RBAC (Row Based Access Control).

- simpler db client

- simpler cloud architecture

- simpler resource management

- simpler partial backups/restore

- simpler compliance with law enforcement

- partitioning might be easier, e.g. when handing "user account storage which should be undo-able for a while" (e.g. long term absent users data could be moved to cold storage, blocked/deleted users data could move to some scheduled for deletion space allowing undoing it for a while but then reliable auto deleting them, a copy of users data where crime detection triggered (e.g. CASM) could be moved to a quarantine space, etc. And each of the spaces can be completely different servers with different storage methods and retention policies, virtual access control and physical access control. Sure you can have all of that with RBAC + partitioning + triggers + roles in postgres, but it's the personal data store of a user so you don't need cross users FK constraint enforcement and it makes it much easier to make sure you don't miss anything wrt. access controll or forgetting to partition/move some columns of a new table etc.)

- maybe simpler billing for storage ("just" size of DB)

now simpler doesn't mean better, but often it pays of as long as you don't run into the limits of what is possible with the simpler architecture (and as far as I can tell you can shard this approach really nice, so there at least there shouldn't be scaling performance limits, scaling cost and future feature complexity limits might still apply)

You didn’t systemically document “harder”.

Anyone interested in joining Bluesky, please grab these. I have extra and I've already invited all my Twitter mutuals I wanted to invite.

Edit: I'm all out now :)

All used :( Do you have anymore ?

At a previous fintech role the company would store customer accounts as encrypted sqlite3 files on blob storage ... this worked out decently well for our access patterns.

How did they lock the file when re-uploading it after edits?

That sounds like centralizing

good luck with running updates.

This should make leaving the service rather simple. Download your sqlite file and throw up a simple local-only html front end to the data and you're solid.