I tried a similar approach in a previous startup - treat data as documents and store in a JSONB field.
Postgres was awesome and handled this brilliantly, but the lack of schema and typing killed it. We just ended up fighting data quality the whole time. We couldn't assume that any document had all the required fields, or that they were in a format that made sense e.g. the Price column sometimes had currency symbols, and sometimes commas-and-periods in UK/US format and sometimes in Euro format - sorting by Price involved some complicated parsing of all the records first.
We moved back to relational tables.
I won't say I'd never do this again, but I would definitely not just throw JSON documents to a database and expect good things to happen.
If you have schema requirements, why _not_ do it with tables and columns instead? The point of the jsonb column would be precisely to store arbitrary stuff that has no schema.
I usually see this (dynamic scripting langs, schemaless databases) play out as a very good environment to prototype in, very few barriers to change things or make stuff up as you go. Then the prototype is not discarded but "iterated" to form the actual production code. The (now needlessly so) flexible JSON store is grand-fathered in.
I have yet to come across an actual use case for JSON stores in a production app with an established design. What the hell do you mean you have no idea what the data might hold?? Why are you storing unknown, undefined, undefinable things??? Or perhaps, there actually is a schema i.e. fields we rely on being present, but we were too lazy to define it formally?
Adjacent data that is very alike for most usage but different in subtle ways.
Classified ads, the additional details for a Car are different than a Shirt, but both would be ads. And adding a nearly infinite number of fields or a flexible system in a set of schema or detail tables is significantly worse than unstructured JSON.
Another would be records from different, but related systems. Such as the transaction details for an order payment. Paypal data will be different from your CC processor, or debit transaction, but you can just store it as additional details for a given "payment" record.
Another still would be in healthcare, or any number of other systems where the structures will be different from one system to another depending on data interchange, where in the storage you don't care too much, only in the application layer will it make any necessary difference.
Customers sent custom things.
Healthcare. Clinical data is unstructured and constantly changing. Would you build out a table with 2000 columns that changed yearly? What about 5000?
Data owned by another service/application/org that is written and read without modification.
For example, a healthcare document that wasn't built or technically owned by the application storing it.
For example, a web text editor that serializes it's state as json.
Not json, but a web scraper storing html documents.
These have structure, it's only that the structure is built/maintained outside of the application storing it. You could of course transform it but I think it's a bit obvious where that might not be worth the cost/complexity.
Well, there are a decent number of those use cases, but the one I find most compelling is storing an exact record of what another service sent me in case I need it later. I pull out and store in my typed schema the relevant bits of, say, an OpenID Connect response, but I'll often store the entire response as well in case I want to add functionality in the future based on additional fields, etc. in that response and roll forward without having to have an intermediate state.
You can always change the source of truth so instead of being the JSON texts it's the properly normalized schema and then you can have a VIEW that returns JSON for apps that need JSON. And if need be you can denormalize a bit and have a JSONB column that stores the document as the serialization of the rows' other columns.
That's what's really nice about SQL: you can perform these sorts of schema surgeries and still retain backwards-compatibility using VIEWs.
A use case for me is I’m calling a lot of different 3rd party APIs that return similar things but in widely different data structures with a lot of information specific to that site. I take what is standard across all the API’s and store that in my relational database but anything I didn’t use I store in a JSONB field just in case I need it some time in the future.
In other words I’m not really using anything in that JSONB field… at least right now.
We're evaluating solutions to ingest business app events and infrastructure logs. Document-ish JSON-ish "DB"/search engine seems like best solution, because no one does structured logs properly.
Now we have to pick the stack. ELK, Loki + Graphana, Graylog or maybe just dump into MongoDB?
I think my learning was "all data has schema, eventually"
We thought it was just a bunch of data documents. But it turned out that to actually use that data in an application it had to have predictability, we had to know certain fields were present and in a fixed format. You know: a schema.
I'm confused about this. Since when can't you enforce a schema on a document you're storing? Did you have no way to change the data before it hit the database?
NoSql stores still have schemas.
Sounds like most/all of these issues would be solved by validation at the app layer.
True, but for example with db-level validations I don’t have to worry that a new developer ships a batch job that runs raw inserts for performance reasons and bypasses app layer validations.
I'm not advocating for no database level validations. One can and should have both in my opinion.
Postgres supports constraints on jsonb as well.
I would generally advocate for no write access outside of the app as well. Certainly for new developers. Get some tests on that batch job.
FWIW I think OP was referring to app code still, but code opting in to skipping app data validations. Rails for example makes this very easy to do, and there are tons and tons of people out there selling this as a performance improvement (this one trick speeds up bulk operations!!!!). There are times where it’s useful, but it’s a very sharp knife that people can easily misuse.
So yeah, anyway, have both db constraints and app validations.
Thanks for pointing that out, I did misread that and have seen/written such things in rails land. We used active record import which gives the option of running validations.
But yeah, layering your safety nets is generally wise. I would include testing and code review in that as well.
What? You should be exposing your data access layer as a microservice and all your db level validations should be done there...
You can even write a small DSL syntax language to make it easier to use for developers, and perhaps an Obviously Reduntant Middleware that sits between them to convert their programming language objects to the DSL. Add some batch support, perhaps transactional locks (using mongo, we want to be webscale after all) and perhaps a small terminal based client and voila, no one should ever need to deal with petty integrity in the db again.
This post is a fantastic example of Poe's Law. You had me for a second.
App layer validation is for the user’s sanity. DB layer validation is for the developer’s sanity.
Ehhh, most dbs are managed by migrations in the application, so essentially is duplicated logic
Most databases eventually have a human typing an INSERT statement or the moral equivalent. It's not duplication to have your correctness functions protect that, too.
I like this as a general rule, but I almost always have more refined validation at the app layer -- particularly when using jsonb.
Tool of choice is ecto which has excellent support for maintaining jsonb structure.
the view I now have is that for a relational table, yes you have to suffer through migrations but at least they are in sql. for document based stores, you still have to have migrations, but they are just implemented in code
json documents sound great, especially initially, but end up being a maintenance nightmare
The problem with that - in my experience - is that migrating the structure for thousands (if not millions) of documents is way slower than running a DDL command (as it means reading each document, parsing it, modifying it and writing it back). Many DDL commands are just metadata update to the system catalogs so they are quite fast (e.g. adding a new column with a default value). With documents you wind up with millions of single row updates.
This can be mitigated by doing a "lazy" migration when a document with the old structure is first read. But that makes the code much more complicated.
Or, to put it another way, yes you have to write and maintain an upfront schema, but a document-based system has a schema too, it's just distributed amongst 50 codebase and 10 people's heads.
I don’t do much backend work at the moment but I found using JSONB fields for prototyping endpoints to be quite handy. Once I had a feature figured out and working, I’d create a proper schema.
I've found Postgres' Hstore type useful for this, too.
I often create a "metadata" hstore field on a table and use it for random bits of data I don't want to create an actual field for yet. When I find that the application needs that bit of data and in a certain format, I'll move it into an actual field.
hstore is basically the flat precursor to JSONB, and IMO it's as good as obsolete. I wouldn't use hstore in a new project, just JSONB.
Well this is not a relational issue is it? It is a data normalization issue
Yes, true. But that was a single example off the top of my head. There were lots of others.
Not saying the approach isn't suitable for some use cases. Just that I'd be really careful that this is one of those use cases next time.
This is an issue regardless of your type of Document storage. There are different options for verifying document details and versioning when approaching Document oriented databases. A typical usage is to have a per-record version number and pass documents through a verifier either as part of insert, update or query to ensure all migrations to the current version have occurred.
You know I’ve found that LLM’s are awfully good at taking random JSON crap and “converting it” to a consistent format. You don’t even need a large LLM… something like ChatGPT 3.5 or an equivalent class of LLM can do the work just fine.
It’s not perfect and requires a fair amount of effort to nail the prompt but when it works it works.
Had a similar experience. What killed it for me, is that no statistics can be gathered for JSONB columns. This in turn really messes with the query planner once you do something like `select a.* from a join b on a.id = b.a_id where b.my_jsonb_column ->> 'foo' = 'bar';`.
Given the lack of statistics, the query planner loves going for a nested loop rather than hash or merge join where those would appropriate, leading to abysmal performance.
There is an thread[0] on the PostgreSQL mailing list to add at least some statistics on JSONB column, but this has gone nowhere since 2022.
[0]: https://www.postgresql.org/message-id/flat/c9c4bd20-996c-100...