PostgreSQL is enough

I often go down rabbit holes like this, trying to collapse and simplify the application stack.

But inevitably, as an application grows in complexity, you start to realize _why_ there's a stack, rather than just a single technology to rule them all. Trying to cram everything into Postgres (or lambdas, or S3, or firebase, or whatever other tech you're trying to consolidate on) starts to get really uncomfortable.

That said, sometimes stretching your existing tech is better than adding another layer to the stack. E.g. using postgres as a message queue has worked very well for me, and is much easier to maintain than having a totally separate message queue.

I think the main takeaway here is that postgres is wildly extensible as databases go, which makes it a really fun technology to build on.

I have certain experience with some technologies, e.g., SQS and Postgres.

Say I'm on your team, and you're an application developer, and you need a queue. If you're taking the "we're small, this queue is small, just do it in PG for now and see if we ever grow out of that" — that's fine. "Let's use SQS, it's a well-established thing for this and we're already in AWS" — that's fine, I know SQS too. I've seen both of these decisions get made. (And both worked: the PG queue was never grown out of, and generally SQS was easy to work with & reliable.)

But what I've also seen is "Let's introduce bespoke tech that nobody on the team, including the person introducing it, has experience in, for a queue that isn't even the main focus of what we're building" — this I'm less fine with. There needs to be a solid reason why we're doing that, and that we're going to get some real benefit, vs. something that the team does have experience in, like SQS or PG. Instead, this … thing … crashes on the regular, uses its own bespoke terminology, and you find out the documentation is … very empty. This does not make for a happy SRE.

Ok. I get that. But to play devil's advocate: with that mentality we'd never learn a new technology and still be stuck on punch cards. And I don't have the time anymore for hobby projects. I'd say it's ok to introduce something new as long as it's one thing at a time and not an entire new stack in the "a rewrite will solve all problems" projects

To me this argument sounds like “I don’t have time for hobby projects, so I’m going to treat this professional one as a hobby”.

I always start a professional project with technologies I am intimately familiar with - have used myself, or have theoretical knowledge of and access to someone with real experience.

There has never been a new shiny library/technology that would have saved more than 10% of the project time, in retrospect. But there have been many who would have cost 100% more.

This isn't a dichotomy.

That is the point of DDD,SoA,Clean, Hexagonal patterns.

Make a point to put structures and processes in place that encourage persistence ignorance in your business logic as the default and only violate that ideal where you have to.

That way if you outgrow SQL as a message bus you can change.

This mindset also works for adding functionality to legacy systems or breaking apart monoliths.

Choosing a default product to optimize for delivery is fine, claiming that one product fits all needs is not.

Psql does have limits when being used as a message or event bus, but it can be low risk if you prepare the system to change if/when you hit those limits.

Letting ACID concepts leak into the code is what tends to back organisations into a corner that is hard to get out of.

Obviously that isn't the Kool aid this site is selling. With this advice being particularly destructive unless you are intentionally building a monolith.

"Simplify: move code into database functions"

At least for any system that needs to grow.

I was not saying "psql is all you'll ever need". I was just replying to

>> "Applied consistently, this logic would seem to preclude becoming familiar with anything."

As a general principle.

I take your point but you don’t explain how you came to be intimately familiar with those technologies in the first place. Applied consistently, this logic would seem to preclude becoming familiar with anything.

For projects where I have a paying customer, this rule is absolute; I do not experiment on my client's time (and dime) unless they specifically request it.

But I do have projects which I finance myself (with myself as customer), and which do not have a real deadline. I can experiment on those. Call them "hobby" projects if you insist.

Applied consistently, this logic would seem to preclude becoming familiar with anything.

Well, project requirements always rank higher, and many projects require some piece I am unfamiliar with (a new DB - e.g. MSSQL; a new programming language; etc). That means one does get familiar on a need basis , even applying this approach robotically.

If a project requires building the whole thing around a new shiny technology with few users and no successful examples I can intimately learn from ... I usually decline taking it.

I'm okay with new technology, actually, but the person introducing it has to be able to champion it & do the work of debugging issues and answering questions about its interactions with the rest of the system. I.e., they have to be responsible for it.

The last part in my parent comment is more of a "it was chucked over the fence, and it is now crashing, and nobody, not even the devs that chose it, know why".

I do have examples of what you describe, too: a dev I worked with introduced a geospatial DB to solve issues with geospatial queries being hard & slow in our then-database (RDS did not, at the time, support such queries) — so we went with the new thing. It used Redis's protocol, and was thus easy to get working with¹. But the dev that introduced it to the system was capable of explaining it, dealing with issues with it — to the extent of "upstream bugs that we encounter and produce workarounds", and otherwise being a lead for it. That new tech, managed in that way by a senior eng., was successful in what it sought to do.

The problematic parts/components/new introductions of new tech … never seem to have that. That's probably partly the problem: it's such an inherently non-technical issue at its heart. The exact thing almost doesn't matter.

as long as it's one thing at a time

IME it's not. When there are problems, it's never just one new thing at a time.

a rewrite will solve all problems

And the particular system I had in my mind while writing the parent post was, in fact, in the category of "a rewrite will solve all problems".

Some parts of the rewrite are doing alright, but especially compared to the prior system, there are just so. many. new. components. 2 new queue systems, new databases, etc. etc. So it's then hard to learn one, particularly without someone championing its success. It's another to self-learn and self-bootstrap on 6 or 8 new services.

¹(Tile38)

This desire can sometimes be so strong that people insist on truly wacky decisions. I have before demonstrated that Postgres performs perfectly well (and in fact exceeds) compared with a niche graph database, and heard some very strange reasons for why this approach should be avoided. A lot of the time you hear that it's engineers who chase shiny technology, but I've seen first hand what can happen when it's leadership.

Can you expand on Postgres vs Graph Databases?

IIRC, the biggest (AFAIK) graph DB in the world, TAO from Facebook, is based on an underlying MySQL. There must be a good reason why FB prefers a SQL DB over a dedicated graph DB.

It is easy to represent a graph in Postgres using edge and node tables. For the use case we have, it is more performant to query such a setup for many millions of relationships vs using the big names in graph databases.

You just need a little bit of appropriate index selection and ability to read the output of EXPLAIN ANALYZE to do so.

There are probably use cases where this doesn't hold, but I found in general that it is beneficial to stick to Postgres for this, especially if you want some ability to query using relations.

Often referred to as resume driven development.

RDD leaves serious wreckage in its wake.

I've been on both sides of this..

Rabbit MQ and Elastic Search for a public facing site. The dedicated queue for workers to denormalize and push updates. To elastic. Why, because the $10k/month RDBMS servers couldn't handle the search load and were overly normalized. Definitely a hard sell.

I've also seen literally hundreds of lambda functions connecting to dozens of dynamo databases.

I'm firmly in the camp of use an RDBMS (PostgreSQL my first choice) for most things in most apps. A lot of times you can simply apply the lessons from other databases at scale in pg rather than something completely different.

I'm also more than okay leveraging a cloud's own MQ option, it's usually easy enough to swap out as/if needed.

I worked in a team that did this. It was mostly staffed by juniors, and the team leader wasn't very interested in the technical aspects, they just went to a page, checked that the new feature worked alright and gave the green light.

So over the years these juniors have repeatedly chosen different tech for their applications. Now the team maintains like 15-20 different apps and among them there's react, Vue, angular, svelte, jQuery, nextjs and more for frontends alone. Most use Episerver/Optimizely for backend but of course some genius wanted to try Sanity so now that's in the mix as well.

And it all reads like juniors built it. One app has an integration with a public api, they built a fairly large integration app with an integration db. This app is like 20k lines of code, much of which is dead code, and it gets data from the public api twice a day whereas the actual app using the data updates once a day and saves the result in its own Episerver db. So the entire thing results in more api traffic rather than less, the app itself could have just queried the api directly.

But they don't want me to do that, they just want me to fix the redundant integration thing when it breaks instead. Glad I'm not on that team any more.

I think SQS is cheap enough to build on as a messaging queue even if you're not hosting within AWS.

Out of the widely underrated AWS services include SNS and SES and they are not a bad choice even if you're not using AWS for compute and storage.

SQS is at least once PG can give you exactly once

Not sure why this is making people upset.

Because it's incorrect. If you have any non-postgres side-effect, you can't have exactly-once (unless you do 2PC or something like that). There isn't any technology that gives you "exactly once" in the general case.

That's not how exactly once is defined for queue. We are talking about semantics of what queue systems is providing.

Nobody will understand it like that.

Anyone who has ever selected queue service/product will understand it like that. Because thats one of the most prominent features that gets highlighted by those products:

SQS Standard queues support at-least-once message delivery.

NATS offers "at-most-once" delivery

etc.

SQS FIFO has exactly-once processing

well that's a stretch it has "5 minute window" You can hold a lock on a row in PG queue for as long as you need

pgmq (which is linked on this gist) provides an api to this functionality. It can be 0 seconds, or 10 years if you want. It's not a row lock in, which can be expensive. In pgmq, its build into the design of the visibility timeout. FOR UPDATE SKIP LOCKED is there to ensure that only a single consumer gets any message, and then the visibility timeout lets consumer determine how long it should continue to remain unavailable to other consumers.

You get exactly once when you consume with pgmq and run your queue operations inside transactions in your postgres database. I can't think of an easy way to get some equivalent on SQS without building something like an outbox.

SQS FIFO has exactly-once processing

That's not what the parent post was referring to. If SQS (or your ability to talk to it) is down and your database isn't, what do you do?

The problem is rarely cost, it's operational overhead.

Using SQS for a queue rather than my already-existing Postgres means that I have to:

- Write a whole bunch of IaC, figuring out the correct access policies - Set up monitoring: figure out how to monitor, write some more IaC - Worry about access control: I just increased the attack surface of my application - Wire it up in my application so that I can connect to SQS - Understand how SQS works, how to use its API

It's often worth it, but adding an additional moving piece into your infra is always a lot of added cognitive load.

+. And then you have to figure everything one more time when you decide to move to (or to add support for) Azure/GCP.

My saying has always been: be nice to the DB

Don't use it anymore than you have to for your application. Other than network IO it's the slowest part of your stack.

Would you say it's slower than file IO too?

It’s not slow by itself. It’s a single point of bottleneck that will inevitably become slow as you cram everything into it.

...but by trying to avoid the bottleneck and moving things to backend, you make things 10x worse resource wise for the DB. So it is not a easy tradeoff.

Take any computation you can do in SQL like "select sum(..) ...". Should you do that in the database, or move each item over the network and sum them in the backend?

Summing in the database uses a lot less resources FOR THE DB than the additional load the DB would get from "offloading" this to backend.

More complex operations would typically also use 10x-100x less resources if you operate on sets and amortize the B-tree lookups over 1000 items.

The answer is "it depends" and "understand what you are doing"; nothing about it is "inevitable".

Trying to avoid computing in the DB is a nice way of thinking you maxed out the DB ...on 10% of what it should be capable of.

Yes. Aggregations and search are often best done as close to the data as possible, in the DB.

Rendering html, caching, parsing api responses, sending emails, background jobs: Nope.

Basically, use the database for what it’s good at, no more.

Kind of irrelevant since a DB provides some guarantees that a simple file does not by default.

GP was responding to a comment comparing it to network IO in terms of bottlenecks in your application stack ...?

Well, it is file IO, plus processing on top. But it's not that simple, since if your data is small enough it can all be loaded into memory, allowing you to sidestep any file IO. But you still have the processing part...

Handling business logic in the database is often going to be an order of magnitude faster than the application layer of some of the popular language stacks (looking at you, Rails, Node, etc). It also will outlive whatever webstack of the day (and acquisition which of en requires a re-write of the application layer but keeps general database structure - been there done that).

Maybe faster... but I've met very few developers that are good DBAs (that understand procedures, cursors, permissions etc.) Database schema versioning / consistency is a whole other level of pain too.

That sounds like a social problem, not a technical problem.

PG works really well as a message queue and there's several excellent implementations on top of it.

Most systems are still going to need Redis involved just as a coordinator for other pub/sub related work unless you're using a stack that can handle it some other way (looking at BEAM here).

But there are always going to be scenarios as an application grows where you'll find a need to scale specific pieces. Otherwise though, PostgreSQL by itself can get you very, very far.

Worth noting that Postgres has a pubsub implementation built in: listen/notify.

https://www.postgresql.org/docs/current/sql-notify.html

Yep, I should add that. One of the libraries in my list (that I maintain) is WalEx: https://github.com/cpursley/walex/issues

It subscribes to the Postgres WAL and let you do the same sort of thing you can do with listen/notify, but without the drawbacks like need for triggers or character limits.

What's the drawback to a trigger? I would think that any overhead you recouped by avoiding a trigger would be offset by the overhead of sending the entire WAL to your listener, rather than the minimized subset of events that listener is interested in.

(To be clear I do see other downsides to listen/notify and I think WalEx makes a lot of sense, I just don't understand this particular example.)

You don’t send the entire WAL, just what you subscribe to - and you can even filter via SQL: https://github.com/cpursley/walex?tab=readme-ov-file#publica...

This post describes some of the other issues with listen/notify trigger approach: https://news.ycombinator.com/item?id=36323698

Going to add this to my research list.

Ping me if you have any questions. Long time fan of your blog.

That is really cool to hear, thank you.

And I should have mentioned it before, but we have an open call for speakers for the Carolina Code Conference. This would make for an interesting talk I think.

Oh yea, definitely aware of it. I believe many of the queuing solutions utilize it as well.

I've ready a lot of reports (on here) that it comes with several unexpected footguns if you really lean on it though.

PG works really well as a message queue

It's also worth noting that by using PG as a message queue, you can do something that's nearly impossible with other queues - transactionally enqueue tasks with your database operations. This can dramatically simplify failure logic.

On the other hand, it also means replacing your message queue with something more scalable is no longer a simple drop-in solution. But that's work you might never have to do.

But inevitably, as an application grows in complexity,

Some applications never grow that much

But it surely will! It will!

See, now that we're profitable, we're gonna become a _scale up_, go international, hire 20 developers, turn everything into microservices, rewrite the UI our customers love, _not_ hire more customer service, get more investors, get pressured by investors, hire extra c-levels, lay-off 25 developers and the remaining customer service, write a wonderful journey post.

The future is so bright!

Honestly, that's one of the reasons I never want to monetize my work and stay miles away from the software industry. Modern world is all web apps that require you to subscribe to 20 different 3rd party services to even build your app. So you rack up bills before your product is even remotely lucrative...

Building an app with no third party dependencies seems impossible nowadays. At least if you plan to compete.

I mean, you _can_ host your staging environment on a Minisforum PC hidden in your closet and then deploy to Hetzner, and probably save a _ton_ unless your service benefits from things like, say, autoscaling or global low-latency access.

Niches where you can get away with that are limited, not just by technical challenges but because large parts of the social ecosystem of IT won't like that. But they do exist. There's also still things that aren't webapps _at all_, there's software that has to run without internet access. It's all far apart and often requires specialized knowledge, but it exists.

Yea I mostly learned web dev so far but wanted to get into IoT stuff so I might find something cool to do in there.

I would go further and even say 'most' applications never grow that much.

The same argument of UNIX design patterns (Single responsibility, well defined interfaces and communication protocals) vs Monolithic design patterns comes up a lot. I think that its mainly because both are effective at producing products, its just that they both have downsides.

I read a meme yesterday about how you can just interject "it's all about finding that balance" into any meeting and people will just agree with you. I'm gonna say it here.

Sometimes a flexible tool fits the bill well. Sometimes a specialized tool does. It's all about finding that balance.

Thank you for coming to my TED talk.

Just noting that sometimes one can do both: seperate postgres DBs/clusters for different use-case, seperate instances of a web server for TLS termination, caching, routing/rewriting, Ed:static asset serving. Benefit is orderly architecture, and fewer different dependencies.

I think a lot of the industry struggles with the idea that maybe there is no "one size fits all", and what makes sense when you're a one person company with 100 customer probably doesn't make sense when you're a 1000 people company with millions of customers.

If you use a stack meant for a huge userbase (with all the tradeoffs that comes with it) but you're still trying to find market fit, you're in for a disappointment

Similarly, if you use a stack meant for smaller projects while having thousands of users relying on you, you're also in for a disappointment.

It's OK to make a choice in the beginning based on the current context and environment, and then change when it no longer makes sense. Doesn't even have to be "technical debt", just "the right choice at that moment".

It's OK to make a choice in the beginning based on the current context and environment, and then change when it no longer makes sense.

Yep. And Postgres is a really good choice to start with. Plenty of people won't outgrow it. Those who do find it's not meeting some need will, by the time they need to replace it, have a really good understanding of what that replacement looks like in detail, rather than just some hand-wavy "web scale".

True enough and with modern hardware that barrier is relatively high. IIRC Stack overflow was handling several million users in a single database server over a decade ago... We've got over 8x the compute power and memory now.

Still need to understand the data model and effects on queues though.

The more I do fullstack work the more I see an obesity crisis. I under the need to modularize (I dearly think I do) but god you have relational model, reimplemented in your framework, reencoded as a middleware to handle url parsing, the one more layer to help integrate things client side. I find that insane. And Postgrest was a refreshing idea.

Seriously. There's like 7000 duplicates of the very same data layer in a single stack: database, back-end ORM/data mapper, front end and various caching things in between. Things like PostgREST and Hasura area great pared with fluent clients.

And then there's the failed microservice case.. what some people describe a distributed monolith where data has to be passed around through every layer, with domain logic replicated here and there.

That's a nicely balanced view. I've been working on the intersection between dev, sec and ops for many, many years and one of the most important lessons has been that every dependency is a liability. That liability is either complexity, availability, security, wasting resources or projects or key people disappearing. Do anything to avoid adding more service, library or technology dependencies; if necessary, let people have their side projects and technological playgrounds to distil future stacks out of.

There are good reasons to go OLAP or graph for certain kinds of problems, but think carefully before adding more services and technologies because stuff has a tendency to go in easily but nothing ever leaves a project and you will inevitably end up with a bloated juggernaut that nobody can tame. And it's usually those people pushing the hardest for new technologies that are jumping into new projects when shit starts hitting the fan.

If a company survives long enough (or cough government), a substantial and ever increasing amount of time, money and sec/ops effort will go into those dependencies and complexity cruft.

This is very much the way I'm pushing in our internal development platform: I want to offer as little middlewares as possible, but as many as necessary. And ideally these systems are boring, established tech covering a lot of use cases.

From there, Postgres ended up being our relational storage for the platform. It is a wonderful combination of supporting teams by being somewhat strict (in a flexible way) as well as supporting a large variety of use cases. And after some grumbling (because some teams had to migrate off of SQL Server, or off of MariaDB, and data migrations were a bit spicy), agreement is growing that it's a good decision to commit on a DB like this.

We as the DB-Operators are accumulating a lot of experience running this lady and supporting the more demanding teams. And a lot of other teams can benefit from this, because many of the smaller applications either don't cause enough load on the Postgres Clusters to be even noticeable or we and the trailblazer teams have seen many of their problems already and can offer internally proven and understood solutions.

And like this, we offer a relational storage, file storage, object storage and queues and that seems to be enough for a lot of applications. We're only now adding in Opensearch after a few years as a service now for search, vector storage and similar use cases.

you start to realize _why_ there's a stack, rather than just a single technology to rule them all

Architecturally, there are other cases besides message queues where there's no reason for introducing another layer in the stack, once you have a database, other than just because SQL isn't anybody's favorite programming language. And that's the real reason there's a stack.

On top of that, a lot of discourse seems to happen with an assumption that you only make the tech/stack choice once.

For the majority of apps, just doing basic CRUD with a handful of data types, is it that hard to just move to another DB? Especially if you're in framework land with an ORM that abstracts some of the differences, since your app code will largely stay the same.

This makes a strong case, but I've decided to start every new project with sqlite and not switch until absolutely necessary. If Postgres is the 90% case, then sqlite is the 80% case and is also dead simple to get going and genuinely performant. So when vertical scaling finally fails me, I know I'll be at a wonderful place with what I'm building.

[...] sqlite is the 80% case and is also dead simple to get going and genuinely performant.

I don't understand this. PostgreSQL is ALSO dead simple to get going, either locally or in production. Why not just start off at 90%?

I mean, I get there are a lot of use cases where sqlite is the better choice (and I've used sqlite multiple times over the years, including in my most recent gig), but why in general?

Postgres complicates the application in several ways. In particular, Postgres suffers from the n+1 problem, while SQLite does not. That requires a significant amount of added complexity in the application to hack around. Why over engineer the application before it has proven itself as something anyone even wants to use? Let's face it, the large majority of software written gets thrown away soon after it is created.

I already hear you saying that you know of a library that provides a perfect abstraction to hide all those details and complexities, making the choice between Postgres and SQLite just a flip of a switch away. Great! But then what does Postgres bring to the table for you to choose it over SQLite? If you truly prove a need for it in the future for whatever reason, all you need to do is update the configuration.

This is a misunderstanding of the n+1 problem, which is exacerbated by SQLite's deceptive phrasing of the issue:

In a client/server database, each SQL statement requires a message round-trip from the application to the database server and back to the application. Doing over 200 round-trip messages, sequentially, can be a serious performance drag.

While the above is true on its own, this is _not_ the typical definition of n+1. The n+1 problem is caused by poor schema design, badly-written queries, ORM, or a combination of these. If you have two tables with N rows, and your queries consist of "SELECT id FROM foo; SELECT * FROM bar WHERE id = foo.id_1...", that is not the fault of the DB, that is the fault of you (or perhaps your ORM) for not writing a JOIN.

> that is the fault of you (or perhaps your ORM) for not writing a JOIN.

It's your fault for not writing a join if you need a join. But that's not where the n+1 problem comes into play.

Often in the real world you need tree-like structures, which are fundamentally not able to be represented by a table/relation. No amount of joining can produce anything other than a table/relation. The n+1 problem is introduced when you try to build those types of structures from tables/relations.

A join is part of one possible hack to workaround to the problem, but not the mathematically ideal solution. Given an idealized database, many queries is the proper solution to the problem. Of course, an idealized database doesn't exist, so we have to deal with the constraints of reality. This, in the case of Postgres, means moving database logic into the application. But that complicates the application significantly, having to take on the role that the database should be playing.

But as far as SQLite goes, for all practical purposes you can think of it as an ideal database as it pertains to this particular issue. This means you don't have to move that database logic into your application, simplifying things greatly.

Of course, SQLite certainly isn't ideal in every way. Tradeoffs, as always. But as far as picking the tradeoffs you are willing to accept for the typical "MVP", SQLite chooses some pretty good defaults.

Often in the real world you need tree-like structures, which are fundamentally not able to be represented by a table/relation. No amount of joining can produce anything other than a table/relation. The n+1 problem is introduced when you try to build those types of structures from tables/relations.

I don't know how precisely strict you expect a tree to be in RDBMS, but this [0] is as close as I can get. It has a hierarchy of product --> entity --> category --> item, with leafs along the way. In this example, I added two bands (Dream Theater [with their additional early name of Majesty], and Tool), along with their members (correctly assigning artists to the eras), and selected three albums: Tool's Undertow, with both CD and Vinyl releases, and Dream Theater's Train of Thought, and A Dramatic Turn of Events.

The included query in the gist returns all available information about the albums present in a single query. No n+1.

The inserts could likely be improved (for example, if you were doing these from an application, you could save IDs and then immediately reuse them; technically you could do that in pl/pgsql, but ugh), but they do work.

This is also set up to model books in much the same way, but I didn't add any.

A join is part of one possible hack to workaround to the problem, but not the mathematically ideal solution.

Joins are not a "hack," they are an integral part of the relational model.

[0]: https://gist.github.com/stephanGarland/ec2d0f0bb54161898df66...

> Joins are not a "hack," they are an integral part of the relational model.

Yes, joins are an essential part of the relational model, but we're clearly not talking about the relational model. The n+1 problem rears its ugly head when you don't have a relational model – when you have a tree-like model instead.

> The included query in the gist returns all available information about the albums present in a single query. No n+1.

No n+1, but then you're stuck with tables/relations, which are decidedly not in a tree-like shape.

You can move database logic into your application to turn tables into trees, but then you have a whole lot of extra complexity to contend with. Needlessly so in the typical case since you can just use SQLite instead... Unless you have a really strong case otherwise, it's best to leave database work for databases. After all, if you want your application to do the database work, what do you need SQLite or Postgres for?

Of course, as always, tradeoffs have to be made. Sometimes it is better to put database logic in your application to make gains elsewhere. But for the typical greenfield application that hasn't even proven that users want to use it yet, added complexity in the application layer is probably not a good trade. At least not in the typical case.

n+1 can show up any time you have poorly modeled schema or queries. It’s quite possible to have a relational model that is sub-optimal; reference the fact that there are 5 levels of normalization (plus a couple extra) before you get into absurdity.

I still would like to know how SQLite does not suffer from the same problems as any other RDBMS. Do you have an example schema?

> I still would like to know how SQLite does not suffer from the same problems as any other RDBMS.

That's simple: Not being an RDMBS, only an engine, is how it avoids the suffering.

The n+1 problem is the result of slow execution. Of course, an idealize database has no time constraints, but the real world is not so kind. While SQLite has not figured out how to defy the laws of physics, it is able to reduce the time to run a query to imperceptible levels under typical usage by embedding itself in the application. Each query is just a function call, which are fast.

Postgres' engine can be just as fast, but because it hides the engine behind the system layer, you don't interact with the engine directly. That means you need to resort to hacks to try and poke at the engine where the system tries to stand in the way. The hacks work... but at the cost of more complexity in the application.

Compare and contrast the query execution stages of PostgreSQL and SQLite. How exactly do they work? Please be as precise as possible. Try to avoid imprecise terms like "simple", "suffering", "system layer", and "hack."

For what purpose? I can find no source of value in your request.

To prove that you have an inkling about the subject you have wandered into.

Engineers and Scientists do that; they don’t hide behind airy and vague language – that is the realm of conmen.

You stated that you cannot implement a tree-like structure with tables, so I set about proving you wrong, and posted a gist that does so.

Until you can back up your claims with data, your words are meaningless, and no one here is going to take you seriously.

Prove to who? You?

Engineers and scientists are paid professionals. They are financially incentivized to help other people. Maybe you somehow managed to not notice, but I am but a person on an Internet forum. There is no incentive offered for me to do anything for anyone else. To have the gall to even ask for me to work for you without any compensation in kind is astounding.

Of what difference does it make if anyone takes me seriously or not? That's the most meaningless attribute imaginable. Oh noes, a random nobody on the internet doesn't believe me! It's the end of the world as we know it... How droll.

Perhaps it is that you do not understand what value means? Or how did you manage to write all those words and not come up with any suggestion of value whatsoever?

To educate your adoring fans

But for what purpose? There is no value in educating adoring fans.

We are still dealing with the fact that SQLite still HAS the n+1 "problem", it's just fast enough that it doesn't suffer from it.

It's a very important distinction because, as you say, there are problem domains where you can't just "join the problem away".

Are we? "Suffer" was the word used right from the beginning for good reason. Computers aren't magic. I find no importance in pointing out that fact. Presumably everyone here already knows that. And if it turns out that they don't, who cares? That's their problem. Makes no difference to me.

I agree that suffer is the right word, but unclear. You are getting down voted because a lot of people are interpreting to mean you are saying applications using sqlite don't have n+1 queries.

1. At time of writing, there has been one downvote in the first comment, followed by one upvote in the subsequent comment. Not only does that not translate to "a lot of people", it was quite likely just one person. And unless that person was you, it is impossible to know what their intent was. I'm not sure what are you trying to add here.

2. Who gives a shit? If the "computers truly are magic" camp don't understand what I wrote, great! It wasn't written for them in the first place. If that compels them to use their time pressing a button instead of learning about how computers work, great! Not my problem. I'm not sure what you are trying to add here.

Who gives a shit? [...] I'm not sure what you are trying to add here.

I guess nothing. You must be fun at design reviews.

Well, let's hope the "computers are magic" riff-raff never show up at the design reviews. Especially if they expect someone to explain to them the basics of computing without any reasonable offer of compensation in return.

If those people show up here and put on a tantrum by pressing random buttons or whatever it was that you were trying to point out... Oh well? I wouldn't have even noticed if you didn't bring it up.

What value you found in calling attention to their pointless tantrums is an interesting mystery, though!

No amount of joining can produce anything other than a table/relation. The n+1 problem is introduced when you try to build those types of structures from tables/relations.

A trivial amount of lateral joins plus JSON aggregates will give you a relation with on record, containing a nested JSON value with a perfectly adequate tree structure, with perfectly adequate performance, in databases that support these operations.

There are solutions to these problems. One only needs to willingness to accept them.

Of course. I know reading is hard, but nobody has suggested there aren't solutions. Obviously there are, else Postgres would be pretty much unusable. But the question remains: Why resort to hacks that require extra complexity in the application when you can just as easily use the database as it was designed to be used? What are you getting in return?

Often in the real world you need tree-like structures, which are fundamentally not able to be represented by a table/relation. No amount of joining can produce anything other than a table/relation. The n+1 problem is introduced when you try to build those types of structures from tables/relations.

You can easily get hierarchical output format from Postgres with its JSON or XML aggregate functions.

You can have almost all benefits of an embedded database by embedding your application in the database.

Just change perspective and stop treating Postgres (or any other advanced RDBMS) as a dumb data store — start using it as a computing platform instead.

If you need to query over a tree data structure, then that's what WITH RECURSIVE is for, and it's present in both DBMS.

If you additionally need the result of that query to be hierarchical itself, then you can easily have PG generate JSON for you.

Postgres suffers from the n+1 problem, while SQLite does not.

Indeed.

It’s worth elaborating:

N+1 Queries Are Not A Problem With SQLite

https://www.sqlite.org/np1queryprob.html#:~:text=N%2B1%20Que....

Meaning insertion sort does not suffer from the problems that made us construct quicksort because we optimized the insertion a lot (at a tradeoff that we will not specify).

I think "dead simple" is not doing anyone any favors when it is being used to try to equate the simplicity of things.

It's obviously a lot simpler to just have a file, than to have a server that needs to be connected to, as long as we're still talking about running things on regular computers.

I guess that's really my point here.

They difference in setup time is negligible so I'm not sure why people keep bringing it up as a reason to choose sqlite over PostgreSQL.

For instance, "deployable inside a customer application" is an actual requirement that would make me loath to pick PostgreSQL.

"Needs to be accessible, with redundancy, across multiple AWS zones" would make me very reluctant to pick sqlite.

Neither of these decisions involve how easy it is to set up.

It's like choosing between a sportbike and dump truck and focusing on how easy the sportbike is to haul around in the back of a pickup truck.

Needs to be accessible, with redundancy, across multiple AWS zones

How many projects start with these requirements?

Anything with real customer data in the cloud? Certainly you need replication.

Litestream can handle realtime replication.

But most projects don’t even have customers when they start, let alone large quantities of their data and legal requirements for guaranteed availability.

I think it's reasonable for a business, even a freshly starting one, to expect to grow to the point where it does have enough customers to outgrow SQLite fairly soon. Between that and PG offering more features in general (which often simplifies app code, because you can do more with queries), it's still not clear why not start with PG in the first place.

PG, on the other hand, can scale enough to cover foreseeable future needs for most businesses, so aiming for something more complex than that is almost always premature.

outgrow SQLite fairly soon

That would be the result of either vastly overestimating their business plan or vastly underestimating SQLite.

> How many projects start with these requirements?

In a world fueled by cheap money and expensive dreams, you'd be surprised.

I'm not sure its neglibile, I suppose once you know what you're doing.

But postgres setup, at least the package managers on Linux, will by default, create a user called postgres, and lock out anyone else who isn't this user from doing anything. Yeah you can sudo to get psql etc. easily, but that doesn't help your programs which are running as different users. You have to edit a config file to get to work, and I never figured out how to get to work with domain sockets and not TCP

That's interesting... my experience (almost all on RHEL/CentOS/Fedora) is that it is trivial to have unix domain socket with local Postgres clients and a pain to setup any remote clients.

You just have to call the basic "createuser" CLI (or equivalent CREATE ROLE SQL) out of the postgres superuser account to create database users that match local Linux usernames. Then the ident-based authentication matches the client process username to the database role of the same name.

PostgreSQL is ALSO dead simple to get going

I'm not saying it's hard to set up Postgres locally, but sqlite is a single binary with almost no dependencies and no config, easily buildable from source for every platform you can think of. You can grab a single file from sqlite.org, and you're all set. Setting up Postgres is much more complicated in comparison (while still pretty simple in absolute terms - but starting with a relatively simpler tool doesn't seem like a bad strategy.)

Except for when your data is in it. Migrating data on a running app is one of the worst things to deal with. I can understand using something simple and cut down for other things, but the DB is not the place I'd want to do that. Postgres isn't exactly hard to get going with, and will grow with you easily, so why trade that off for saving an hour or two at the start of the project?

Except for when your data is in it.

Replacing the DB before it gets any actual data inserted into it solves this problem. You just switch to Postgres before you go anywhere beyond staging, at the latest - in practice, you need Postgres-exclusive functionality sooner than that in many cases, anyway. Even when that happens, you might still prefer having SQLite around as an in-memory DB in unit tests. The Postgres-specific methods are pretty rare, and you can mock them out, enjoying 100x faster setup and teardown in tests that don't need those methods (with big test suites, this quickly becomes important).

Unless you really want to use Postgres for everything like the Gist here suggests, the DB is just a normal component, and some degree of flexibility in which kind of component you use for specific purposes is convenient.

One use case where SQLite is a good option is for embedding as a local database in an app. Starting local-only with SQLite allows you to defer a lot of the backend effort while testing an MVP.

You might find https://github.com/electric-sql/electric pretty cool.

it's much bigger and requires running a server, that's why I use sqlite3, and my needs are 99% modest most of the time.

It requires running a separate process.

If dead simple involves me babysitting a service process then not it is not. SQLite has embedded version that requires no service out of process. That's what my definition of dead simple.

But PostgreSQL is not dead simple when compared to SQLite.

Practically, because sqlite is good enough for one machine and compatible-enough with postgresql that you can use either pretty easily. One thing I wrote was an exactly-once stream processor that fetched events from a lot of remote systems for processing. Transaction-based queue in the DB to achieve exactly-once with recovery (remote systems accepted time-stamp resyncing of the stream of events). It works fine at small scale on a single machine for design and testing (local integration tests with short startup time are very valuable) but trivially scales to hundreds of workers if pointed at a postgres instance. The work to allow sqlite vs postgres was a single factory that returned a DB connection in Go based on runtime configuration.

It's also good practice for designing reasonably cross-database compatible schemas.

I’m with you in general, but what about vector search?

It really feels like the DB industry has taken a huge step backward from the promise of SQL. Switching from Postgres to SQLite is easy because the underlying queries are at least similar. But as soon as you introduce embeddings, every system is totally different (and often changing rapidly).

Just use SQLite?

Specialized vector indexes become important when you have a large number of vectors, but the reality of software is that it is unlikely that your application will ever be used at all, let alone reach a scale where you start to hurt. Computers are really fast. You can go a long way with not-perfectly-optimized solutions.

Once you have proven that users actually want to use your product and see growth on the horizon to where optimization becomes necessary, then you can swap in a dedicated vector solution as needed, which may include using a vector plugin for SQLite. The vector databases you want to use may or may not use SQL, but the APIs are never that much different. Instead of one line of SQL to support a different implementation you might have to update 5 lines of code to use their API, but we're not exactly climbing mountains here.

Know your problem inside and out before making any technical choices, of course.

You can of course use a vanilla database, read every row and just roll your own vector distance function, but it's just frustrating that there isn't a standardized pattern for this.

There are plenty of proprietary databases and APIs, but now you're taking on a dependency and assuming a certain amount of risk.

> it's just frustrating that there isn't a standardized pattern for this.

Be the change you want to see, I suppose. No doubt convergence will come, but it is still early days. Six months ago, most developers didn't even know what a vector database is, let alone consider it something to add to their stack.

It took SQL well into the 1990s to fully solidify itself as "the standard" for relational querying. Even PostgreSQL itself was started under the name POSTGRES and was designed to use QUEL, only moving over to SQL much later in life when it was clear that was the way things were going. These things can take time.

It took SQL well into the 1990s to fully solidify itself as "the standard" for relational querying.

IBM had SQL in their database product in 1981, Oracle had it by v4 in 1984, ANSI picked SQL as its standard that same year, and completed the first version by 1986.

Some time scientists say that the 1980s occurred before "well into the 1990s" but I mean, who can really say, right?

"The standard", not "a standard". I know, reading is hard.

I know, reading is hard.

We're done here.

Done with what?

There are vector search solutions for sqlite that basically work, so if my project doesn't rely on that 100% then I'm willing to use those as stop-gaps until it does.

Of course, if there's a shortcoming of sqlite that I know I need right out of the gate, that would be a situation where I start with postgres.

Postgres has a vector search extension!

https://supabase.com/docs/guides/database/extensions/pgvecto...

Vector similarity searches can be done with simple SQL statements, and while the performance will not match a vector db, it's often good enough. For a db like SQLite that can run in-memory, I suspect they would be reasonably fast.

Not with concurrent writes, you're not!

Even with a WAL or some kind of homegrown spooling, you're going to be limited by the rate at which one thread can ingest that data into the database.

One could always shard across multiple SQLite databases, but are you going to scale the number of shards with the number of concurrent write requests? If not, SQLite won't work. And if you do plan on this, you're in for a world of headaches instead of using a database that does concurrency on its own.

Don't get me wrong; SQLite is great for a lot of things. And I know it's nice to not have to deal with the "state" of an actual database application that needs to be running, especially if you're not an "infrastructure" team, but there's good reasons they're ubiquitous and so highly regarded.

It’s effortless to get 2-4K writes per second with SQLite on cheap commodity hardware.

That will carry most early stage applications really far.

As a hardcore c++ guy, I recently switched to a company heavily into databases. I never had contact to databases before. And I'd like to go one step further: Why databases?

I come from an industry that heavily uses custom binary file formats. And I'm still bewildered by the world of databases. They seem to solve many issues on the surface, but not really in pratice. The heavy limitations on data types, the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea. I get the interop benefits, and maybe with extreme data volumes. But for anything else, what's the point? Genuinely asking

Well a database is just a binary file format with an API attached on.

Because having hundreds or thousands of concurrent file handles across a data center is kind of hard.

ok I get that, but at least our applications usually only have one thing (one thread of one process on one machine) operating on the database. So maybe it's just our silly usecase

It's a narrow use-case that doesn't cover RDMS. SQLite fits it though.

The first thing I did was ask why we don't use sqlite or at least postgres. The answer was that they are free and therefore we don't trust them. So Oracle it is. Which is bananas because our customers have to buy an oracle license, which is money that we don't get. Pretty wild

The answer was that they are free and therefore we don't trust them.

If you don’t mind a bit of unsolicited advice: run.

I mean you do a job because of the pay, not because you like the people or they are particularly competent. So it's a good gig for two years or so

I’ve found that surrounding myself with competent people - or better yet, people far better than me - has lead me to much better paying jobs.

It also makes you look less like a genius haha. Yeah maybe you're right. It wasn't the best choice in hindsight. But switching jobs is a relatively rare thing in Europe. You only can do that ever so often, so I'll have a bit of waiting to do

Your experience with databases was probably clouded by having to use Oracle.

It could be worse; there's always DB2...

Sqlite is running on billions of devices and has a test suite with 100% branch coverage.

Only a fool would put more trust in Oracle.

I think it's important context that gets lost in so many tech conversations. Software engineering is a vast field covering almost every domain of human endeavor and they're all going to have different constraints. Some are technical and some are social. If you're working with 30 other engineers trying to marshal and unmarshal the same set of data for different purposes, it's really easy to tell them all "use SQL" instead of "here's the binary format invented by the guy who quit 7 years ago" even if that binary format was technically superior.

That's unusual. The rule is that you replicate the database consumers lots and lots of times.

Still then, a database will improve your data access, validate your data, and provide interop in case you need to use the file with different software.

It's also a very high quality serialization library, that replaces one of the most vulnerability-enabling layers of your software. (But then, I've just read you use Oracle, so maybe forget that one.)

It gives you an easy, high-level way to use high performance data-structures and algorithms. You don't need to explicitly write or rewrite code to maintain hash maps or b-trees or whatever and to use the right structures for fast lookups from one set of data to another. You just say "CREATE INDEX name ON table USING HASH(column)", and from then on, your hash map will maintain itself, and any lookups that would benefit from that hash map will automatically use it. No need to rewrite any of your code that needs to work with that column. In some cases, it will also automatically do things like make temporary hashmaps or bitmaps for just the life of a query to speed it up.

You mention further down a "mystery box of performance", but if you understand what data structures it's using and how it uses them, then it's generally pretty straightforward. Mostly you can reason about what indices are available and how trees work (and e.g. whether it can walk two trees side-by-side to join data) to know what query plan it should make, and you can ask it to tell you what plan it makes and which indices it used. Likewise, if you have a query plan you want it to run (loop over this table, then use this column to look up in this table, etc.), you'll know what indices are needed to support that plan.

If people struggle with using a database correctly, they're really going to struggle with using something like a b-tree in a way where you don't corrupt your data in the event of a power loss or crash, or in a way where multiple threads don't clobber each other's updates or create weird in-between states (or you just use a global lock, but then you lose performance).

It's pretty opaque to me. If something in my cpp code is dodgy or runs slow, I can use a number of debugging and profiling tools. While I don't really even know how databases work on the insides, let alone profile or diagnose them. To this day my colleagues rewrite equivalent SQL statements because some run better than others. And we regularly run into unexpected latency spikes where most of the time a statement runs fine, but even nth time it's several magnitudes of times slower - and no one knows why. So we cobble code and caches around things. It all seems pretty mind boggling to me.

If you're going to find yourself working with databases, I'd suggest learning some about the internals (that's probably true of anything). In particular Markus Winand's information[0] is great for building an intuition about why different types of queries work with different types of indices. I don't know about Oracle, but Postgres and Mysql have pretty detailed documentation once you have that foundational knowledge.

If you learn about the internals of a thing, especially when your background is in lower level dev like C++, then the use-cases are more obvious: you use the thing whenever you would've done what it does internally, but it gives you that functionality off-the-shelf and wrapped up in a way where you can write business logic without getting bogged down in details of tree-traversal and stuff. Once you get comfortable with it, you expand that to using it when you might not have done things exactly that way, but eh it's close enough and lower effort.

Sometimes truly equivalent SQL statements will be faster just because the optimizer is not perfect. e.g. I've had cases where I had a templated query with a GROUP BY some id, and then other code added on a HAVING for that same id, and I know it should be algebraically valid to push the HAVING into a WHERE so it runs before the GROUP BY (and filtering before the aggregation would be much cheaper), but mysql just didn't have that optimization. Dealing with this kind of thing can be annoying.

Other times you might have something like a compound index, and you might add a WHERE that you know for business reasons is redundant because the thing you're trying to filter on is not the first column in the index. Understanding why that works comes down to understanding what a compound index "looks like" as a tree. One thing that I imagine a database from the future could do is let you define logical implications like that (e.g. StateOrProvince = California implies Country = United States, or maybe deleted_at >= modified_at >= created_at, or a.id > b.id implies a.created_at > b.created_at) that it could use for query planning.

But in general, if you learn how it works, and then think of it as a way to not have to write that functionality yourself (but understand that the trade-off is some rigidity in your ability to customize it), it will make more sense, and you'll be able to become one of those wizards that just knows how to rewrite a query to something that ought to be the same, but is for some reason much faster.

[0] https://use-the-index-luke.com/

I don't plan on staying long enough so that a deep investment is useful. I'd rather continue my efforts into getting rid of some database stuff. I certainly won't work with databases ever again after this, so every hour I spend with it is wasted. I'll pass on this technology. But I appreciate the effort.

They seem to solve many issues on the surface, but not really in practice

I believe you haven't had a chance to work on problems that require an actual database. Multi-user access, ACID support, unified API (odbc/jdbc), common query language... all of these would require many man-years to set properly with a custom solution.

the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea

What update disasters? If you meant by updating database versions, these aren't things you do frequently because the database is expected to be running 24/7. But Postgres and Mysql are already rock solid here. Wrt SQL engine incompatibilities, you usually set with a single database vendor in practice. If you suddenly start to switch databases in the middle of the project, something needs to be fixed with the process design, not database.

All these comments seem to fuel my suspicion that we in fact shouldn't use databases, because we don't use any of these features. We just use them as external data storage for a single application. And not even that much data, like <10 gigs.

But the updating I would have expected to go more smoothly. If you make a point of using a software dedicated to managing data, I sure as hell would expect an update to go so smooth that I don't have to worry about or even notice it. In reality updates more often than not seem to come with undocumented errors. That is a constant source of frustration for me.

because we don't use any of these features. We just use them as external data storage for a single application.

You are using it :) Reboot the server where the database runs or suddenly cut off the connection. Unless you have ACID-compatible storage, you'll have malformed data.

Plan for the future and use a database from the start. When your project/company expands and starts to use multiple applications/services (and that inevitably happens), you'll see (one of) the benefits of the database.

But the updating I would have expected to go more smoothly

I'm not sure what you are talking about. Database updates are one of the smoothest (critical) software updates you'll find, assuming the database has a good track record.

eh the oracle upgrades went awful. But I try to not touch the database at all if I can

It's a decent question. I believe there's a simple answer that explains a large part of the reason people choose databases over flat files. People want ACID compliance _and_ data distribution, and databases are traditionally the only things that provide both.

I have never heard the term ACID so I don't believe that's a reason for us at least

tl;dr it's an acronym that explains how the operations happening inside a database are reliable.

For example, when making a bank transfer, the database guarantees that the update doesn't happen twice by accident.

Just because you haven't heard it doesn't mean it's not a fundamentally important part of storing data :)

Read the original relational database paper by Codd. The problem is that trees of data are not flexible for querying, you really need a graph. If you draw it out on paper, you will find a relation is a very efficient way to store general graphs. I'm actually suprised to hear this from a C++ engineer, because sorting and searching columns of data is the name of the game in C++.

Another key problem they solve is separating the logical representation from the on disk representation. You can evolve and and optimize storage without breaking anything.

The other problem with files is that you have a disconnect between in memory and on disk. You have to constantly serialize and deserialize. Sqlite has quite a bit of info on this: https://www.sqlite.org/appfileformat.html

But that's the thing: All that is nice on paper. Of course the relational nature is nice. Of course the disk representation is nice. But I never feel like that's worth the trouble day in day out. The costs, the mystery box of performance, the insane statements, the extra hardware, the updates, the strange data types, the box of tricks everyone needs to make them behave. But I've been made aware that our usecases suck, so I'll attribute most to that.

The problems are not specific to SQL though - if you want to store and manage huge amounts of data you're going to need some specialized data structures, some way to manage disks paging and memory, manage consistency, isolation, durability, and atomicity of your changes.

The general idea is that its such a PITA to deal with all that so just let a database do it, and of course the consequence is an often leaky abstraction because computers suck.

Besides what others are saying here, data integrity is a lot easier with a relational database if your data is heavily relational. I have managed systems with <1gb of data that would have been much harder to build without primary and foreign keys

About data types, what limitations are you referring to? jsonb is well supported in many dbs and throwing random large binary blobs in the middle with your normal data is a bad idea with or without a relational database.

I can see the appeal to replace noSQL solutions though, a lot of people are using S3 (and other storage solutions) as a makeshift database lately

I think my wtf moment was that you can't have real variable length strings. And I'm pretty pampered by c++ highly nuanced numeric types with varying sizes and signed/unsigned.

fair enough, relational DBs are optimised for querying so you often see tradeoffs like this.

Having written a small, bespoke query engine (why? long story not for today, but mainly it was super-optimized for a very particular type of query, and very fast, with 10-50us average query times), I don't ever want to do it again except if -and only if- the performance we can wring out of it is absolutely essential and otherwise unobtainable. I'd rather use an off-the-shelf query engine any day and every day.

I would happily work on query engines, it's just I don't recommend using bespoke ones unless absolutely necessary.

I think you'll come around eventually.

Over years, I've seen a lot of poorly made binary formats used by various programs. Databases like PostgreSQL are where such binary formats had evolved through benchmarking, testing for various edge cases, and are generally better than whatever you will find in the bespoke formats developed for a particular program.

You have to be really good at understanding how storage works, have a lot of time and resources to develop your own program to beat something like PostgreSQL. They have zillions of man-hours on you when you start, experience and knowledge. It's not impossible that you could find a case where a bespoke format would beat an established storage product, but over a range of cases, you most likely won't.

And, of course, there's a convenience aspect. Outside of niche technologies, SQL offers the richest language for data description and operations on data.

As for the inconsistencies between SQL implementations: in practice, it matters very little: most programs will never migrate between different SQL implementations anyways.

As for upgrades: it's a doubly-edged sword. You get a very expressive data format and it's not surprising that it's hard to upgrade. But, try to match the abilities of SQL in your own format, and you'll probably find out that it's hard to have generic tools for upgrading it too.

None of this means that you cannot do better than SQL databases. It'd be ridiculous to think there could be a way to prove that SQL is somehow the best we can get. It's just that it's very hard to do better. Especially if you want a universal tool

I'm one of the makers of ParadeDB, a modern alternative to Elasticsearch. We build Postgres extensions to do fast search (pg_bm25) and analytics (pg_analytics). I love Postgres. If you have a small workload, like a startup, it certainly makes sense to stay within Postgres as long as you can.

The problem is, at scale, Postgres isn't the answer to everything. Each of the workloads one can put in Postgres start to grow into very specific requirements, you need to isolate systems to get independent scaling and resilience, etc. At this point, you need a stack of specialized solutions for each requirement, and that's where Postgres starts to no longer be enough.

There is a movement to build a Postgres version of most components on the stack (we are a part of it), and that might be a world where you can use Postgres at scale for everything. But really, each solution becomes quite a bit more than Postgres, and I doubt there will be a Postgres-based solution for every component of the stack.

what is "at scale"? Is there a specific metric or range of metrics that raises a flag to begin considering something else? For example, in the olden days when it was my problem, page load times were the metric. Once it got high enough you looked for the bottleneck, solved it, and waited. When the threshold was broken again you re-ran the same process.

Is there an equivalent for postgres?

This bugs me every time performance comes up. No one is ever concrete, so they can never be wrong.

If Michael Jackson rose from the dead to host the Olympics opening ceremony and there were 2B tweets/second about it, then postgres on a single server isn't going to scale.

A crud app with 5-digit requests/second? It can do that. I'm sure it can do a lot more, but I've only ever played with performance tuning on weak hardware.

Visa is apparently capable of a 5-digit transaction throughput ("more than 65,000")[0] for a sense of what kind of system reaches even that scale. Their average throughput is more like 9k transctions/second[1].

[0] https://usa.visa.com/solutions/crypto/deep-dive-on-solana.ht...

[1] PDF. 276.3B/year ~ 8.8k/s: https://usa.visa.com/dam/VCOM/global/about-visa/documents/ab...

minor nit: 9K TPS for Visa are business transactions - TBD how many database transactions are generated...

(still, modern postgresql can easily scale to 10,000s (plural) of TPS on a single big server, especially if you setup read replicas for reporting)

Yeah, I don't mean to say Visa can run global payment processing on a single postgres install; I'm sure they do a ton of stuff with each transaction (e.g. for fraud detection). But for system design, it gives an order of magnitude for how many human actions a global system might need to deal with, which you can use to estimate how much a wildly successful system might need to handle based on what processing is needed for each human action.

For similar scale comparisons, reddit gets ~200 comments/second peak. Wikimedia gets ~20 edits/second and 1-200k pageviews/second (their grafana is public, but I won't link it since it's probably rude to drive traffic to it).

yyy we're in violent agreement!

interesting re reddit, that's really tiny! but again, I'm even more curious about how many underlying TPS this turns into, net of rules firing, notifications and of course bots that read and analyze this comment, etc. Still, this isn't a scaling issue because all of this stuff can be done async on read replicas, which means approx unlimited scale in a single-database-under-management (e.g. here's this particular comment ID, wait for it)

Slack experiences 300K write QPS: https://slack.engineering/scaling-datastores-at-slack-with-v...

For me with any kind of data persistence backend, it's when you go from scaling vertically to horizontally. In other words, when it's no longer feasible to scale by just buying a bigger box.

I don't know that there is a canonical solution for scaling Postgres data for a single database across an arbitrary number of servers.

I know there is CockroachDB which scales almost limitlessly, and supports Postgres client protocol, so you can call it from any language that has a Postgres client library.

The truth is that it really depends on your application work load. Is it read-heavy, or write-heavy? Are the reads more lookup-heavy (i.e. give me this one user's content), or OLAP heavy (i.e. `group by`'s aggregating millions of rows)? Is read-after-write an important problem in your application? Do you need to support real-time/"online" updates? Does your OLAP data need to be mutable, or can it be immutable (and therefore compressed, columnarized, etc.)? Is your schema static or dynamic, to what degree?

I agree with others that a good simplification is "how far can you get with the biggest single AWS instance"? And the answer is really far, for many common values of the above variables.

That being said, if your work load is more OLAP than OLTP, and especially if your workload needs to be real-time, Postgres will begin to give you suboptimal performance without maxing-out i/o and memory usage. Hence, "it really depends on your workload", and hence why you see it's common to "pair" Postgres with technologies like Clickhouse (OLAP, immutable, real-time), RabbitMQ/Kafka/Redis (real-time, write-heavy, persistence secondary to throughput).

For scaling, has anyone here used hash based partitioning to scale horizontally?

In principle, seems like it should work to allow large scale distribution across many servers. But the actual management of replicas and deciding which servers to place partitions, redistributing when new servers are added, etc. could lead to a massive amount of operational overhead.

Citus is indeed an example for "distributed PostgreS". There are also serverless Postgres (Neon, Nile, AWS Aurora) which do this.

If you are interested in partitioning in an OLAP scenario, this will soon be coming to pg_analytics, and some other Postgres OLAP providers like Timescale offer it already

I think that's what Citus does.

As other sibling comments noted, Citus does this pretty well. Recommend reading through their docs and use-cases. There's some migration/setup costs, but once you have a good configuration, it mostly just-works.

Main downside is that you either have to either self-manage the deployment in AWS EC2 or use Azure's AWS-RDS-equivalent (CitusData was acquired by MS years ago).

FWIW, I've heard that people using Azure's solution are pretty satisfied with it, but if you're 100% on AWS going outside that fold at all might be a con for you.

hash based partitioning makes repartitioning very expensive. most distributed DB now use key-range based partitioning. Iirc, Dynamo which introduced this concept has also made the switch

This looks great, I'll add it to my list.

I've gone far out of my way not to use Elasticsearch and push Postgres as far as as I can in my SaaS because I don't want the operational overhead.

This is exactly why we built ParadeDB :)

Love postgres and use it extensively. However, there's always an issue when I start doing the more advanced stuff: how do I combine that with all my years of experience with version control, code reviews, types, tests, static analysis and all the niceties of coding in general?

Migrations?

If you want to serious develop postgres as how you develop code, then you should also use a LSP. https://github.com/supabase/postgres_lsp

Thanks, I need to add this to my list as well.

RDBMS's are still very much an 80s thing. Great at dynamic queries, horrible at managing changes in an immutable, atomic, versioned way. The structure and relationship of data is so brittle it's kind of crazy it's still used. Probably it's because so many people have grown dependent on its particular flaws; the incumbent just lumbers on.

There’s always a trade off though. Append-only storage models are great for use cases where they are truly necessary, but they also store substantially more data than mutable databases do, and designing for something like Datomic carries a heavier cognitive cost than a simple RDBMS.

You might be curious to look up how Alembic and Django accomplish migrations. Even if Python's not your jam, the concepts/semantics are language/framework agnostic.

Basically, you store each migration in a file, and you "squish" the migration history down to a table definition once you've decided you're happy with the change and it's been affected across all the different deployment environments.

It's not perfect, but it works reasonably well.

Migrations. All my database logic lives in version control.

Popular tooling like Phoenix, Hasura, etc have good built in migration stories.

https://www.bytebase.com looks really promising.

Hover, I do struggle with one big issue: changing database logic (views, functions, etc) that has other logic dependent on it. This seems like a solvable problem.

It's not a great DX for sure. Once you start stuffing a ton of application logic in Postgres triggers (or worse split logic between your application layer and triggers) your system will explode in complexity really fast. Developers will insert a row in an inconspicuous table and things will break in ways that seem mysterious because it's not obvious, looking at the application code, why the system is misbehaving.

There are commands in most postgres clients, even psql, to view the _currently_ defined functions... but when you go to debug those you will have to look through migrations to see how the function came into it's current state... bisecting through history here is not very useful since each change to the function is a new file. I think this can be fixed though and made much easier, it's just not there yet.

In general I don't think the developer tooling is up to par to push very much of your application logic into postgres itself. I recommend using triggers for consistency and validation or table-local updates (ie: timestamps, audit logs) but keep process-oriented behaviour (iow: when this happens, then that, else this, wait for call and insert here, etc) in the application layer (chasing a cascade of triggers is not fun and quite annoying).

... all that being said, you can do unit testing in Postgres. And there is decent support for languages other than pgSQL (ie: javascript, ocaml, haskell, python, etc). It's possible to build dev tooling that would be suitable to make Postgres itself an application development platform. I'm not aware of anyone who has done it yet.

Obligatory "depends". I'm aware of more than one company that uses postgres for huge datasets serving workflows with high throughput and importance.

It's truly a workhorse, it's comparatively a pleasure to work with, and its extensibility makes it useful in places some other DBs aren't suitable for.

But a relational DB isn't right for every workload.

But a relational DB isn't right for every workload.

While sometimes true, I'll counter that it's more common that the application was not truly designed for a relational DB, and instead was designed for reading and storing JSON.

You can store JSON (since 9x) and JSONB in PostgresQL (since 9.4 back in 2014)

imo, the value in that is interop with relational data. If you're primarily working with json, it's probably best to use a document db.

Yes. I don't mind the occasional JSON[B] column when it makes sense, but it should not be used as an excuse to not design a good schema. In general, if you find yourself trying to fit a non-scalar value into RDBMS, you should reconsider the data model.

I'm very aware of this fact. The issue is that by definition, JSON (or any other non-scalar) cannot be in normalized form, and so what you end up with is devs using RDBMS as a K/V store that has a few joins thrown in.

JSON performance (or JSONB, it really doesn't matter) is abysmal compared to more traditional column types, especially in Postgres due to TOAST.

Properly normalized tables (ideally out to 5NF, but at least 3NF) are what RDBMS were designed to deal with, and are how you don't wind up with shit performance and referential integrity issues.

What's more common is domain dependent.

I use Postgres for a lot, but I can't imagine using it to make HTTP requests.

https://github.com/pramsey/pgsql-http

No. (You don't want to cause blocking I/O in transactions.)

Non-blocking: https://github.com/supabase/pg_net

Impressive!

Yeah, still not sure I’d use it but a pretty interesting thing to do with the database.

That's what I was referring to, wondering if anyone uses it and why. It even says "Why This is a Bad Idea" at the bottom.

Maybe for databases queried directly by users who also need to mix in API responses, and you're wrapping this all up in triggers or some other stored procedure cause you really want to use Postgres without some controller written in Python or JS.

On the [Simplify: move code into databases](https://sive.rs/pg), has anyone actually tried this?

My instinct is that the output won't be simpler at all, but a big twisty web of varying parts and competing use cases, all extremely hard to discover. To me, having two things that do something distinct is simpler, and ironically, Rich Hickey says this exact thing in Simplicity Matters, which is quoted in the article.

One caveat is when you have to upgrade across major database versions. The more your application utilises these features, the more you're likely to hit when next major version upgrade has to happen.

Not that major database versions are any simpler even if you just stick to CRUD for databases.

This is also true of upgrading dependencies in the application layer.

Not that major database versions are any simpler even if you just stick to CRUD for databases.

I mean, even LTS versions have EOL dates, so even at the least frequent, you'll need to do upgrades around every 5 years or so.

Having two distinct things doesn't mean placing them in separate processes. Or we'd all be writing Erlang by now

We kinda all are writing Erlang now, just inferior and overly complex abstractions ("serverless", K8s, etc).

To me, having two things that do something distinct is simpler

I completely agree. Inside the DB, it's hard to argue about performance benefits in a lot of situations. That said, debugging, testing, and developing are so much harder. I love to have a well written bit of code that I can integration test as well as mock up and use for unit tests with other components of my application. It's much trickier to do that in the DB. Sometimes it is a lot more performant, and that's a situation you may want to consider moving towards, but I would try to stay away from it for as long as the performance gain isn't huge.

Postgres is enough as long as you have a good multi-tenant setup e.g. a separate database per customer.

Ran a single postgres instance with multi-tenant SaaS product that crossed 4B records in a few tables, even with partitions and all the optimization in the world, it still hurts to have one massive database.

We still got bought tho, so I will agree its enough

That's easier now than ever with services like neon.tech and fly.io where you can quickly spin up new databases via api.

Better still, take a look at Nile's "tenant virtualization" concept: https://www.thenile.dev/

That is cool, thanks!

I feel like this is going to solve a lot of saas businesses problems.

How did you use partitions for scaling? Hash based partitioning on the primary keys of your tables? Or something else?

Worked at a company that suffered from this too. Compounded by the feature that let customers share data with other customers, so single DB was a decent architecture to begin with. And we were on postgis v1, which had no straightforward upgrade path to 2.0, nor was it supported by later versions of Postgres. Restoring a backup became a multi-day affair.

Postgres is enough as long as you have a good multi-tenant setup e.g. a separate database per customer.

Separate per customer has a lot of advantages (especially around customer security and things like deletion) but you can also shard by customer right from the beginning; customer #1 in the "odd" shard, customer #2 in the "even" shard etc. Switching to database per customer if that is working well is relatively easy so you're future proofed both ways.

Could you write an OS based on PostgreSQL?:)

It depends on your definitions, but if you took psql and Postgres and compared it to something like an OS on a 1980s microcomputer, you're kinda there. You can run arbitrary commands, you can program it, you can debug those programs, you can run programs, and you can access storage. If you're talking more like Linux or Windows, you have.. a bit more work to do ;-)

Isnt SQL turing complete?

I've thought about writing a general-purpose programming language that uses Postgres as its runtime, with the twist that it can save checkpoints to survive system reboots and also interop closely with SQL. Postgres already has a lot of nice built-in types and functions.

Not quite an OS, but they are doing some neat things: https://omnigres.com/

No it's not, because it's very hard to set up in clustered HA environments. This is 2024. It should be possible to just add database nodes.

It is possible, if you pay for it. You can do Multi-AZ Clustered Instances in RDS, where you get the benefits of Multi-AZ failover with traffic sharing.

If you can run your own infra – at least on an EC2 level – you can do things like Citus [0] for Postgres, which is about as close to "just add database nodes" as you'll get.

[0]: https://www.citusdata.com/

Very clunky compared with things like CockroachDB.

Ultimately using something like Postgres in 2024 is just an on-ramp for expensive managed cloud database services, which is probably why it's promoted so much.

which is probably why it's promoted so much

I think what you're actually observing is simply that Postgres is by far the most vendor-neutral DBMS (/API) available, and therefore the volume of conversation & marketing around it stacks up very disproportionately.

In contrast, asides from MySQL all other DBMS options require getting invested in ~one company and relying entirely on the whims & fortunes of their commercial support organisation.

A relevant article and comment thread: https://news.ycombinator.com/item?id=31425872

YugabyteDB seems interesting, I have not personally used it but really considering switching to it from Couchbase.

It's a fork of postgresql with distributed architecture, so you can add and remove nodes as you wish. And it's free if you self-host.

If anyone has experience with YugabyteDB (or any other multi-master PostgreSQL like DBs please let me know!)

Thank you for this post. I've been looking high and low for weeks trying to find a way to connect Looker to a GraphQL API unsuccessfully and apparently "PostgreSQL emulators" are called FDWs as I learned from this link. I had searched every term I could think of for how to simulate or proxy a SQL database into an API but calling it a foreign data wrapper would have never occurred to me.

Did you try describing the problem and the thing you're looking for to chatgpt to see if it knows of a name? I find that making connections like this is one of the things it actually does well

I actually did, repeatedly. I threw away 3 or 4 chat sessions with it about this very subject. It kept hallucinating features into libraries and naming products that don't actually support this translation direction.

And with Google searches the vast majority of results are going the opposite way: creating a GraphQL API from a real SQL schema.

Remember when you could do this with Google?

How about metrics? Would it be possible to do this in Postgres as well? I'm talking e.g. about simple counters that are not directly vital to business logic. I'm currently using Redis for those things as I'm thinking that Postgres seems to be the wrong choice for this kind of data, would anyone disagree?

You can use the TimescaleDB PostgreSQL extension. It works really well and has a high compression rate for time series data, such as metrics. It can also downsample data. I have also tried InfluxDB, but you have to rewire your mental model to fit InfluxDB (it has its own query syntax and internal mechanism). They have versions v1, v2, and v3, and it's not clear which one to use, probably v2 at this point in time. It's kind of confusing and the state of development is very much in flux.

I think TimescaleDB works well up to the single-digit terabyte range (according to various sources). If you need a solution in the multi-digit terabyte or petabyte range, then you probably need something like a distributed VictoriaMetrics setup.

https://github.com/timescale/timescaledb

Would it be possible to do this in Postgres as well?

Of course! The question is only in your requirements. Keeping a simple counter with limited cardinality should work just great. But nowadays monitoring is much more serious than that. For monitoring k8s clusters the average ingestion rate of metrics per second varies from 100K to 2Mil. I don't know if, resource-wise, it would be a right decision to use Postgres for storing this.

So when requirements are high, and they are for real-time infrastructure and applications monitoring, it is better to consider something like ClickHouse (for people familiar with Postgres) or VictoriaMetrics (for people familiar with Prometheus).

So, having worked on a managed postgres service, we had an extension pushing metrics from the database, but we also used redis to keep service metrics

Redis is great. This gist is more about how it's up to you how much you centralize on postgres. But overall being able to offload from the database has value, so "PostgreSQL is Enough" should not read as "You shouldn't need more than PostgreSQL"

Metrics in an unlogged table could be great if you want to query those metrics against existing data. It depends

I worked for years on a massive PostGraphile project where 99% of the backend application code was written as stored procedures and table triggers, and let me tell you, once the complexity of the system moves beyond basic CRUD operations this kind of system is an absolute chore to work with. The code base is extremely difficult to organize and manage, code completion and compile-time error safety is virtually non-existent, deploying new versions will probably be non-intuitive and confusing for everyone, training new programmers on the code is chaos... it is absolutely something I would never recommend to anyone under any circumstance.

That being said, I do fully agree that things like Redis, ElasticSearch, RabbitMQ, and Kafka are probably unnecessary in a large number of projects that decide to add them to the stack if those projects are also already using PostgreSQL as a database.

How does PostGraphile handle migrations?

Hasura, a similar tool - has really nice migration tooling inspired by Rails.

But your point is well taken, even with their nice migrations, I find myself struggling with changes to objects with dependencies (you have to drop all dependents and recreate which is a pita).

PostGraphile didn't have any first-class migration solution back when we started using it. It looks like they have a `migrate` tool, but I can't speak to that since I've never used it.

The way I set everything up, there's one schema containing just the bare tables and data, and then everything else (stuff that PostGraphile generates and all of our stored procedures with business logic) are in a separate schema. That way the migration process is a shell script that essentially 1. drops the entire non-table schema, 2. runs through all of the SQL files that define the application (there's some custom logic to control dependencies by looking for a `-- PRIORITY N` comment on the first line of the source files), 3. attach all the triggers to the tables (since they live in the schema that was dropped). It means we didn't have to worry about figuring out what all needs to be dropped to make a change because literally everything is dropped and recreated from scratch each time we deploy a new version.

I'm speaking in present tense because this application still exists, we've just (more or less) frozen the code base and are slowly migrating all its functionality piecemeal into a ASP.NET Core WebAPI project.

You can use absolutely any migration framework you like with PostGraphile, it’s completely unopinionated about that.

Seems like it's less than a week ago that someone (rightly) posted that it isn't [1]. Postgres is a relational database, and it does this really well. It's not a convection oven.

[1] https://news.ycombinator.com/item?id=39243655

Did you read the article? It’s about how expensive RDS is, not about how Postgres didn’t work.

That is specifically about AWS/RDS and relates to EBS/IOPS moreso than PostgreSQL. In fact their solution was to run the db but not as an RDS instance.

The Solution: Roll your own: For us the solution was fairly simple: don’t use a managed database services and roll our own infrastructure on EC2.

I've done the same running PostgreSQL on EC2 instances with NVMe local raid storage. It's very fast, but then you're responsible for its uptime, updates, backups, etc. That setup was used for performance sensitive but less critical data.

My Trifecta is:

Postgres, Redis, S3

Hasn't steered my wrong yet. Every once in a while I'm tempted to try to use Postgres for Pub/Sub but then I realize that I need Redis for caching and sidekiq anyways, and Redis is amazing too, so why bother.

Have you had to deal with read-heavy/OLAP queries? Of the kind that can't be cached effectively (i.e. arbitrary filters).

If you're open to Elixir (you'll like it coming from Ruby) then you don't even need Redis. Oban + Postgres for jobs, WalEx for database events, Nebulex for distributed caching. It simplifies things so much (and is cheaper to run).

If you twist yourself into a pretzel enough you can make Bash do all these things. Doesn't make it a good idea.

The biggest problem with depending on Postgres is it's a large complex monolith. Any problem you have only has two possible solutions: 1) spend a ton of time trying to twist yourself into a more pretzely shape to get it to do what you want, or 2) replace that thing you wanted with some external thing.

Both of these waste valuable time on something that isn't providing any business value. They also are entirely preventable/avoidable, by simply not putting all your eggs into one basket.

Basically the whole "just use Postgres" philosophy boils down to: I don't want to learn new things, and I just like making things from scratch with custom code. That's great if you're an engineer and you want job security. But it's bad strategy, engineering, and use of time and resources. Anyone who approves of using Postgres in this way should not be in charge of engineering decisions.

Given how things look in, say, the Node.js ecosystem, a large complex monolith has its attractions - chiefly that it doesn't fall apart at you, because it wasn't cobbled together with duck tape.

It also has the benefit that it's a stable solution that has been around for decades, and will likely be around for just as long. Meaning that you won't have to constantly rewrite and update things to keep up with whatever the developers of your favorite framework decided is the Right Way to do things this year.

I don't want to learn new things

I've learned a bunch of new things (running all the things - not fun, putting biz logic in the application layer - slow).

Putting them in the database is simply simpler and more performant. And it requires less code and time to develop (and maintain).

Write me when you can change column order or type

Why would column order need changing? You have control over output column order in SELECT. Do not `SELECT *` in code.

You can change type.

But that's a good point regarding column order, sometimes I find myself wanting to do that.

Was talking to coworker yesterday about a spectrum of where code lives, and the differences from where I started to where I am now in understanding.

Start after college and backend web dev was fully in scripting language, Python or Ruby, and ORMs that completely fogged where any of the data was stored. Rails and ActiveRecord is so good at shrouding the database to the point where you type commands that create databases and you never see them. Classes are written to describe what we want the data to look like and poof! SQL commands are created to build the schema that we never need to see. On this end of the spectrum, the scripting language will stay the same, but we want to be agnostic to where the data is stored.

On on the other end of the spectrum, Postgres is enough. More than enough. Like in the link, it can do all the tasks you ever care about. The code you're writing for the backend / data is about data, not about the script. We care where it's stored, that it's clear the structure, the reads and updates are efficient. We can write all statements in SQL to create tables, functions, trigger, queues, and efficient read queries with indexes to make the data come back to the scripting language in the exact form that's wanted. On this end, we know and optimize how the data is stored, agnostic to the scripting language that uses the data.

I went from the first end of the spectrum to the second. Everything can be done in Postgres. Audibility, clarity, efficiency is much better there than in Python, is my position. The only thing holding it back is that people don't see development from the data side yet, and if you're deciding on tech, it's not easy to use a tech that people don't have as good of development ability yet. There are no Postgres bootcamps right now.

But There's more and more adoption of this I'm seeing, and the money and development of Postgres leads me to trust that it'll be around a very long time, only getting better. Posts about the power of databases, Postgres and some SQLite for example are becoming more and more common. It's a cool change to follow and watch grow.

....backend web dev was fully in scripting language, Python or Ruby...

Not on my bubble, it has been fully in .NET and Java since 2001, with exception of a couple of services written in C++.

This was similar to my journey as well. I'm a self-taught developer and was so green when I learned Rails I didn't even understand that there was such a thing as SQL behind the ORM. Took some grey beard .net folks to share the and power of the database.

I'm gonna go ahead and make a half-serious half-joking hot take:

Actually you should be using an embedded database.

Half-joking because there's like, tons of infrastructure, both literal and theoretical, that you're gonna miss out, and because I'm not sure we have on-disk standards (so less "good software" than "popular software") other than sqlite and libdb, both of which have some issues that make me hesitate before saying "use this."

But often the reason you're ignoring 90% of your RDBMS is because you for some (good or bad) reason want it elsewhere, and the built-in stuff can even get in the way a bit. And this approach means you no longer have to worry about how to version the stored procedures or whatever.

In a way, this is the same take as that gist, but inside out - instead of putting all your code inside Postgres, put Postgres inside your code.

Are you familiar with Erlang's ETS? It's pretty neato: https://elixirschool.com/en/lessons/storage/ets

No, but Erlang having something like that seems very in-character.

So, rather than use the right tool for the job, use one tool for everything.

"I'm stuck in the Taiga and a hammer is all I have"

Use one tool that's known to be "good enough" for most things. If you run into actual perf issues, then look at something else.

What about offline sync

https://github.com/electric-sql/electric

Regarding scale, PostgreSQL may not cover all bases. Though I'm a PostgreSQL fan, I prefer specialized services for specific tasks. Using PG-based plugins could help, but a dedicated SQL-compatible database is often a better fit.

For vector retrieval, going with a database like Milvus, designed for vectors, is usually more efficient and cost-effective. Similar principles apply across domains. Is there any vectordb under PG format?

What if we've got a deeply customized distributed vector search service on PostgreSQL, that's impressive!

pgvector[0] is supported by AWS RDS, Azure, etc.

I haven't gotten a chance to try out Latern[1] yet, but have heard some good things[2].

[0] https://github.com/pgvector/pgvector

[1] https://github.com/lanterndata/lantern

[2] https://tembo.io/blog/postgres-vector-search-pgvector-and-la...

Well it is obviously not enough as most of these are addons. Most of these are already built into MariaDB.

Excerpt from the motivaion of "PostgreSQL HTTP Client":

Wouldn't it be nice to be able to write a trigger that called a web service?

Rather you than me.

> Wouldn't it be nice to be able to write a trigger that called a web service?

Rather you than me.

Yes, don't do that. Instead consume notification streams or logical replication streams and act on those -- sure, you'll now have an eventually consistent system, but you won't have a web service keeping your transactions from making progress. You don't want to engage in blocking I/O in your triggers.

There is https://github.com/pipelinedb/pipelinedb for streaming time series data, but it is abandoned since 6 years.

There is now https://github.com/paradedb/paradedb/tree/dev/pg_analytics (disclaimer: I am one of the makers of pg_analytics)

Specifically with PostgreSQL, number of connections is still the killer (if you don’t have smart proxies). So you can’t be cavalier about putting as many use cases as possible.

For example, when using PG pub/sub, you will run out of connections quick.

Generally, all DBMS needs a smart self adjusting query killer. Without it, one bad query will ruin it for everyone.

WalEx instead of pub/sub (listen/subscribe): https://github.com/cpursley/walex

Supavisor connection pooler: https://github.com/supabase/supavisor

I love postgres - although there are some things I dislike about it - mainly due to me using mysql for a years. All dislikes are mainly syntactical things, like having USING in deletes instead of JOINs. I also liked the output of EXPLAIN more in mysql.

Using postgres made me realize that many of the people I work with had no idea what an ambiguous group by is, because unlike mysql, postgres will never, ever let you do an ambiguous group by. Running into many different things like that over the years has really made me realize how nice postgres is.

While I love MySQL for a variety of reasons (and think it's legitimately better for certain features – tangential to your point, being able to ORDER BY in a DELETE is great), its lax GROUP BY option isn't really a great idea in terms of correctness.

You can emulate it in Postgres with DISTINCT ON if desired, and as long as you also ORDER BY something that makes sense for your query, it should work similarly.

This is really ridiculous

Specifically how so?

btw recently cleaned up my wal cache busting code quite a bit if you're interested.

https://github.com/Logflare/logflare/blob/main/lib/logflare/...

Need to make a lib out of this!!

Thanks from one Elixir-using Chase to another ;)

It's enough until you start working with json and how painful it is to manipulate data.

Postgres is great at json, I use it a bunch.

Aspiring to "use the right tool for the job" leads you into conflict with some of the ideas posted on the gist.

To give a concrete example on the first item in the gist: if i need periodic jobs - and all the operational headaches that go with (rerunning, ordering, dependencies, logging, yada yada...) - is postgres the right tool for the job?

It CAN be, but for most people in most circumstances, it's probably not.

If the purpose is to do database specific things then pg cron can be useful for e.g. periodic vacuums. However for running application logic, maybe less so.

Has anyone made “I am Postgrenough” merch yet?

There is a lot here that I find useful. Mastering multiple technologies: html, node, JavaScript, SQL, CSS - grids and flex, blah, blah, blah is simply impossible. Developing a working set of knowledge is not. The problem is that you don't know what you should know.

A working set raises another set of issues: how do the pieces relate? What is the proper role of each technology?

This gist is extremely useful to me both in helping me to understand more of what I need to know. The article https://sive.rs/pg was very helpful as one way to think about how postgresql could and should fit in.

I just wasted a lot of time debugging database triggers. I wish the ergonomics around some PostgreSQL features were better, but I do love the idea that it supports so much at the database level.

One cool thing if you are running off Postgres - DDL is transactional so you get clean deploy of your new code version (stored procedures) with precise switchover for free.

time-series data? Maybe include timescaledb

Almost every single time I've seen it was better to first prototype a new feature with the existing stack rather than introduce something new. And by careful curation, that initial prototype could be turned into production code with the same stack as well. At some point though the system does creek, and you're starting to wish you'd have a Redis or another specialized tool.

The only important thing though for me is making sure I'd spend some time on writing a nice API wrapper on my end, and only call that. When I really need to use Redis, all that should be needed is changing the implementation inside your wrappers, and testing the migration very well.

People are surprised how long you can delay to make a technical decision, until it becomes obvious.

Supabase integrates a lot of the open source tools listed here. A Postgrs backend as a service offering. Just started using it. Very impressed.

The thing I'm missing in PG is jsonb compression the way mongo does it.

We have both Mongo and PG. PG is much simpler and I would love to ditch Mongo for simplicity.

Only thing I would need is a "dumb" compressed jsonb column. No updates, no queries, only insert, select, delete.

And the same 80-90% compression without maintenance as Mongo on highly repetitive json keys.

Also, it has all the encryptions functions one would need to implement authentication mechanisms, in pure SQL.

https://www.postgresql.org/docs/current/pgcrypto.html

Somewhat related is Dan McKinley’s Choose Boring Technology.

https://boringtechnology.club/

Sure, but I love SQLite’s single file simplicity

A response: https://codefaster.substack.com/p/postgres-is-enough-but-is-...

https://x.com/winsletts/status/1745837166465536093?s=20

Is anyone aware of a streams implementation for Postgresql? I see we can do simple queues, but how about a stream with multiple consumers / consumer groups (much like the Redis streams implementation)?

Michael Stonebraker invented Postgres and has long said 'one size fits all' is an idea whose time has come and gone. https://cs.brown.edu/~ugur/fits_all.pdf

the relational ACID model is overkill for mostly-read data warehouse and verticality helps; streaming is different; graph dbs are different.

Postgres may not be 'all you need', it will take you pretty far though, maybe it's 'all you need most of the time'. 60%+ of the time it works every time.

I think so too [0] glad more and more people come to this realization. It reduces cognitive load tremendously.

[0] https://www.amazingcto.com/postgres-for-everything/

I really like Postgres, but I still don't think Postgres can do everything. From my own experience, when the scale of data becomes very large, when you need to analyze massive amounts of data, your Postgres always seems somewhat inadequate. I think Postgres performs perfectly when dealing with OLTP-type workloads. However, if you need more OLAP support, I recommend you use StarRocks (www.starrocks.io). We now import data from Postgres into StarRocks for data analysis, and the experience is fantastic. Moreover, StarRocks also supports direct queries on data lakes, which is incredible.