The books piqued my curiosity enough to write my own little database
I think many developers go through this phase at some point. I wouldn't even try to fight it. You learn so much about what won't work by doing this. Extremely valuable lessons, if you can spare the time.
Building my own databases gave me more respect for the existing solutions than anything else. Getting the bytes to & from disk quickly isn't the hard part. It's doing it reliably for years on end while supporting use cases you couldn't have even dreamed of.
I often wonder: given how much of the complexity of modern DBMSes exists because of constraints imposed by certain use-cases that only pertain in certain business domains... what efficiencies could we gain if we designed a domain-specific DBMS with the knowledge that use-cases outside of the domain are off-limits and can be ignored?
For example, I currently use general-purpose DBs to deal with datasets that are fundamentally append-only. But what if I had a DB that only supported working with append-only data? A DB that literally has no concept of an update or delete of existing rows — you've got insert, and maybe dropping a table/dataset entirely, and that's it. Could such a DB get away with not implementing MVCC transactions? Could such a DB avoid using a write-ahead log (because the tables are each their own write-ahead logs)? Would it store data more efficiently? Could its indexing be chunkwise-atomic rather than whole-table-atomic and so require less locking? Etc.
Have a look into my DB project: https://sirix.io | https://github.com/sirixdb/sirix
https://sirix.io/docs/concepts.html and in progress tutorial https://sirix.io/docs/jsoniq-tutorial.html may be especially helpful.
It basically turns updates into appends and is based on a persistent tree structure (the header with a reference to the (revision) root page has to be swapped atomically. Other than that the revision indexes for new data are always appended. In order to reduce copy-on-write overhead for updated page (fragments) a sliding snapshot for the data pages is applied.
Naturally, unchanged pages are simply referenced (e.g. through offsets into the file, thus sharing unchanged pages between revisions).
What's also special is a path summary of all unordered paths in a resource, which enables user-defined smaller tailored secondary indexes and other query rewrites :-)
How does Sirix compare to LMDB (esp. MDBX)?
(I ask because AFAIK LMDB derivatives do a similar-sounding thing: it updates pages within a write-transaction by first allocating freelist pages to use to write out new copies of those pages with the changes included; these changes recurse upward because the pages are storing a B-tree, until a modified copy of the root page is made; a commit-log pointer is updated to point to the new root page; and then the old rewritten pages are put into the freelist.)
Note that if your updates are much smaller than a page, you're gonna have a bad time with LMDB. Optimizations like WAL and group commit exist for a reason.
Because it has to copy and write entire pages instead of only force a flush of log records due to a WAL?
Oh, seems also because of random in-place writes of the B-tree.
Basically, it retains all revisions. Furthermore, the main document index is a keyed trie, much like hash array mapped tries. That is storing an array as a tree and using compact page layouts (bitmaps, 4 references pages, full pages) to reduce the page sizes if they are not full. However, Sirix assigns monotonically increasing, immutable, unique node identifiers, thus most inner pages are full with references to the next level pages (also checksums of the child pages are stored along with the references as in ZFS). The height of the tree increases dynamically. Currently every inner page stores at most 1024 references, thus it's a very wide tree, but we should experiment with other sizes.
The leaf pages of the trie store either the data itself/the nodes or nodes of the path summary, nodes of the secondary indexes...
Thus, we have the main document index, but a RevisionRootPage also has references to the tries, which store the secondary indexes. The secondary indexes are read into main memory / are reconstructed from the leaf pages of the tries (usually small), also a small path summary.
The data pages are not simply copied... only nodes, which changed or fall out of a sliding window. Thus, a page may have to be reconstructed in-memory from at most a small number N of page fragments in the worst case. Thus, it needs a device, which is suitable for fast random, small sized parallel reads and sequential writes.
Currently you have to copy a resource starting from a given revision and applying all updates up to the most recent revision with intermediate commits in order to get rid of old revisions, as it only uses one data file per resource (a resource is equivalent to a table in a relational system). Thus, the data files are basically logs. Another file simply stores offsets and timestamps read into memory to retrieve a given revision.
https://sirix.io/docs/concepts.html
and
https://sirix.io/docs/jsoniq-tutorial.html
Should probably help to get a further understanding.
HTH and let me know if you're interested in more details :-)
Thanks for asking
The append only constraint is super nice until it's not. Developing a way to manage this as the dataset scales is challenging. Replaying a 100gb log after a crash could become a problem. I've built entire product prototypes around something like this, but you always reach a point where it has a bug while you were way up in business logic land and so it feels like being ripped right down to hell. It's no longer fun after the first few cases of that experience.
So what did you learn? What were the bugs? GP is referring exactly to your experience - what would this database look like if you kept refining it?
I could see this as being very useful for e.g. security system logging, banking and other temporal transactions, even for VCS.
I learned that single writer principle is maybe the most important thing if you care about performance.
If I kept going, we'd have a perfect time traveling database of everything that ever happened in the enterprise.
I was even proposing a hashing technique that would provide cryptographic guarantees that our log has not been tampered with, given a pre-shared seed signature. This would be shared with all of our customers and placed in their vaults. At any point, they would be permitted to audit all activity we have logged and recompute our hashes themselves.
Really the vision was make B2B consulting with banks and other financial institutions as transparent as possible.
Performance was pretty bananas too. Easily outstripped both sql server and SQLite in testing for our hot path by more than 10x. We weren't doing SQL command processing though. Very application-specific access patterns.
LMAX Disruptor, or it's equivalent abstraction, will be at the heart of any database I ever attempt to write again.
I think you're speaking here about 1. queue-based event-store systems, e.g. combining a compacting Kafka topic with a CQRS reducer agent to serve as a store-of-record + snapshot-state representation respectively; 2. where you've likely restructured what were fundamentally CRUD-esque operations into CQRS commands and events, to hold in the store, just so that they can be folded back down to CRUD updates to a snapshot living in an RDBMS by the reducer? I do agree that this kind of system can get really messy/painful once you have a lot of data.
But I'm thinking more about:
3. "the Dynamo/Hadoop model, in the small": a single-process client-server row-store, but where a table is made of immutable, read-only 64MB chunks, with the newest chunk being an "open" buffer that becomes "closed" as soon as it's filled up; and where these chunks are of a known file-format such that you could directly compose them outside the DBMS in parallel and push them to the DBMS as chunks to be ingested "whole" (i.e. "mounted" as micro-partitions of the table);
4. where the business-domain's types are already fundamentally immutable datatypes, that don't need any "projecting" into a CQRS representation; i.e. where the thing the DB exists to store (and query on!) is the immutable business data, not some latest-computed-state reduction over it, so there's no need to ever play a log into a CQRS reducer, let alone replay that log.
I know that, for example, Clickhouse's MergeTree engine is at its core akin to the kind of system I'm describing in 3 above — but because Clickhouse is designed as a general-purpose RDBMS (and so still offers an API that includes UPDATE and DELETE) rather than being purpose-built for use-case 4 above, it needs to do a whole bunch of stuff on top of that constrained chunked-immutable storage primitive, to virtualize those updates/deletes pre-merge, and to present MVCC transactions. Same with, for another example, CouchDB: an immutable-data-store core, with a ton of logic on top to allow the user to pretend they can update/delete.
If you imagine a version of Clickhouse or CouchDB that was solely focused on delivering use-case 4 above, then you could strip away all the "extra stuff." For use-case 4, the "64MB immutable-once-full micro-partitions" paradigm is literally all that's needed to losslessly convey all domain state; and so a storage engine akin to the one described in 3 is all you need to support it.
(If you're wondering, the business domain I'm working in, where this pertains, is: analytical querying of blockchain data. All the "CQRS events" [blockchain blocks and their transactions] come from third parties, and are all guaranteed-immutable upon insert, if not guaranteed-canonical. [But canonicity can be tracked as a log of chain tip changes, like a git commit log.] If you don't care about blockchains, though, domains with similar needs for immutable append-only analytical stores include financial forensic accounting, and structured API audit-logging as state for rule engines inside Intrusion Detection Systems.)
It's fundamentally how SirixDB approaches this (basically also storing checksums) as also written in another reply :-)
Every commit directly syncs the binary data to the durable storage (currently a file) and incrementally adds data. Furthermore, it stores optionally the changes (type of change/ctx node/updatePosition... in JSON files). For instance, lately I've implemented a simple copy mechanism based on this. Copy a given revision and optionally apply all changes with intermediate commits to also copy the full history up to the most recent revision). However, the main idea is to use the change tracking also for diff visualizations... maybe even stream these via web sockets.
A production ready system BTW may be Datomic.
And it also reminds me of this paper: https://dl.acm.org/doi/abs/10.5555/3275366.3284969
That's more or less what Kafka is.
Kafka doesn't have fast random-access to a row-tuple in a stream by its primary key, let alone declarative indexing by compound keys / computed expressions.
Kafka with those things would be equivalent to what I'm describing, yes.
What about storing the data and thus, the indexes in Kafka. Would it make sense? Let's say currently, I'm storing SirixDB resources in files. However, instead of offsets into a file the index pages could be stored in Kafka optionally (or Pulsar...). Is Kafka too slow for this or only for specific row-tuples? We could make a combined storage caching the pages locally or also storing in the file system and asynchronous storing in Kafk, S3 or whatever.
The irony is an extremely popular general-purpose database like PostgreSQL can sometimes work better* for niche use cases than less popular databases designed specifically for that niche use case.
* better, or perhaps nearly as well such that it’s not worth learning how to use and maintain the less popular single-purpose database.
I certainly wouldn't recommend anyone start with a special-purpose DBMS. General-purpose DBMSes will get you quite far indeed. (They got my data API company to millions in ARR!)
But when you hit a certain scale, and your general-purpose DBMS starts to struggle with things like "delivering millions of sub-1ms reads per second", and you start to look at what would be required to scale horizontally to thousands of elastic nodes with low start-up times and good QoS while serving your 100s-of-TBs dataset... you might just start Greenspunning something. Something that you might not at first realize is a domain-specific DBMS. But, once you do realize that, you may wonder whether someone has already done it better.
And that (rather small) set of people, are the intended customers of a domain-specific DBMS.
And how many serious problems would be caused by people not correctly understanding what use-cases could be ignored in their domain?
IMHO, complexity in the service of giving an easier-to-reason-about interface is usually well worth the cost.
Precisely this type of thinking resulted in S3, which was the answer to: “what if we drop everything in a traditional file system that is not strictly required”?
Several modern database platforms are similarly designed around one or more traditional interface assumptions being dropped.
IMHO, the biggest one is “interactive transactions” or alternatively: long-running transactions. Anything external to the database that can interact with it in one transaction requires locks and all of the complexity that comes with it, such as dealing with deadlocks.
Dropping the requirement for “interactive Telnet-like SQL shells” can massively simplify things.
Yes, I worked on a distributed analytical database that only supported bulk appends (no transactions). I think there are quite a few OLAP databases like that. You don't really need the ACID properties at all for bulk appends, just idempotence.
The „advanced” part of Andy Pavlo’s great great database course discusses classic database design compromises and the tradeoffs, among other things.
See http://www.cs.cmu.edu/~pavlo/
It's a lot easier to go distributed if you're append-only, that's for sure.
There are niche databases like that.
Check out https://evitadb.io/ for a very different use case. It has a rich query language supporting various e-commerce aggregations (as in faceting). They claim it benchmarks 100x faster than Postgres for this specific use case.
Although it looks very cool, if I'm not building an e-shop, I'm not using it. There will be some unfavorable tradeoffs for my case.
Check out TigerBeetle!