Database Fundamentals

The books piqued my curiosity enough to write my own little database

I think many developers go through this phase at some point. I wouldn't even try to fight it. You learn so much about what won't work by doing this. Extremely valuable lessons, if you can spare the time.

Building my own databases gave me more respect for the existing solutions than anything else. Getting the bytes to & from disk quickly isn't the hard part. It's doing it reliably for years on end while supporting use cases you couldn't have even dreamed of.

while supporting use cases you couldn't have even dreamed of.

I often wonder: given how much of the complexity of modern DBMSes exists because of constraints imposed by certain use-cases that only pertain in certain business domains... what efficiencies could we gain if we designed a domain-specific DBMS with the knowledge that use-cases outside of the domain are off-limits and can be ignored?

For example, I currently use general-purpose DBs to deal with datasets that are fundamentally append-only. But what if I had a DB that only supported working with append-only data? A DB that literally has no concept of an update or delete of existing rows — you've got insert, and maybe dropping a table/dataset entirely, and that's it. Could such a DB get away with not implementing MVCC transactions? Could such a DB avoid using a write-ahead log (because the tables are each their own write-ahead logs)? Would it store data more efficiently? Could its indexing be chunkwise-atomic rather than whole-table-atomic and so require less locking? Etc.

Have a look into my DB project: https://sirix.io | https://github.com/sirixdb/sirix

https://sirix.io/docs/concepts.html and in progress tutorial https://sirix.io/docs/jsoniq-tutorial.html may be especially helpful.

It basically turns updates into appends and is based on a persistent tree structure (the header with a reference to the (revision) root page has to be swapped atomically. Other than that the revision indexes for new data are always appended. In order to reduce copy-on-write overhead for updated page (fragments) a sliding snapshot for the data pages is applied.

Naturally, unchanged pages are simply referenced (e.g. through offsets into the file, thus sharing unchanged pages between revisions).

What's also special is a path summary of all unordered paths in a resource, which enables user-defined smaller tailored secondary indexes and other query rewrites :-)

How does Sirix compare to LMDB (esp. MDBX)?

(I ask because AFAIK LMDB derivatives do a similar-sounding thing: it updates pages within a write-transaction by first allocating freelist pages to use to write out new copies of those pages with the changes included; these changes recurse upward because the pages are storing a B-tree, until a modified copy of the root page is made; a commit-log pointer is updated to point to the new root page; and then the old rewritten pages are put into the freelist.)

Note that if your updates are much smaller than a page, you're gonna have a bad time with LMDB. Optimizations like WAL and group commit exist for a reason.

Because it has to copy and write entire pages instead of only force a flush of log records due to a WAL?

Oh, seems also because of random in-place writes of the B-tree.

Basically, it retains all revisions. Furthermore, the main document index is a keyed trie, much like hash array mapped tries. That is storing an array as a tree and using compact page layouts (bitmaps, 4 references pages, full pages) to reduce the page sizes if they are not full. However, Sirix assigns monotonically increasing, immutable, unique node identifiers, thus most inner pages are full with references to the next level pages (also checksums of the child pages are stored along with the references as in ZFS). The height of the tree increases dynamically. Currently every inner page stores at most 1024 references, thus it's a very wide tree, but we should experiment with other sizes.

The leaf pages of the trie store either the data itself/the nodes or nodes of the path summary, nodes of the secondary indexes...

Thus, we have the main document index, but a RevisionRootPage also has references to the tries, which store the secondary indexes. The secondary indexes are read into main memory / are reconstructed from the leaf pages of the tries (usually small), also a small path summary.

The data pages are not simply copied... only nodes, which changed or fall out of a sliding window. Thus, a page may have to be reconstructed in-memory from at most a small number N of page fragments in the worst case. Thus, it needs a device, which is suitable for fast random, small sized parallel reads and sequential writes.

Currently you have to copy a resource starting from a given revision and applying all updates up to the most recent revision with intermediate commits in order to get rid of old revisions, as it only uses one data file per resource (a resource is equivalent to a table in a relational system). Thus, the data files are basically logs. Another file simply stores offsets and timestamps read into memory to retrieve a given revision.

https://sirix.io/docs/concepts.html

and

https://sirix.io/docs/jsoniq-tutorial.html

Should probably help to get a further understanding.

HTH and let me know if you're interested in more details :-)

Thanks for asking

The append only constraint is super nice until it's not. Developing a way to manage this as the dataset scales is challenging. Replaying a 100gb log after a crash could become a problem. I've built entire product prototypes around something like this, but you always reach a point where it has a bug while you were way up in business logic land and so it feels like being ripped right down to hell. It's no longer fun after the first few cases of that experience.

So what did you learn? What were the bugs? GP is referring exactly to your experience - what would this database look like if you kept refining it?

I could see this as being very useful for e.g. security system logging, banking and other temporal transactions, even for VCS.

I learned that single writer principle is maybe the most important thing if you care about performance.

If I kept going, we'd have a perfect time traveling database of everything that ever happened in the enterprise.

I was even proposing a hashing technique that would provide cryptographic guarantees that our log has not been tampered with, given a pre-shared seed signature. This would be shared with all of our customers and placed in their vaults. At any point, they would be permitted to audit all activity we have logged and recompute our hashes themselves.

Really the vision was make B2B consulting with banks and other financial institutions as transparent as possible.

Performance was pretty bananas too. Easily outstripped both sql server and SQLite in testing for our hot path by more than 10x. We weren't doing SQL command processing though. Very application-specific access patterns.

LMAX Disruptor, or it's equivalent abstraction, will be at the heart of any database I ever attempt to write again.

I think you're speaking here about 1. queue-based event-store systems, e.g. combining a compacting Kafka topic with a CQRS reducer agent to serve as a store-of-record + snapshot-state representation respectively; 2. where you've likely restructured what were fundamentally CRUD-esque operations into CQRS commands and events, to hold in the store, just so that they can be folded back down to CRUD updates to a snapshot living in an RDBMS by the reducer? I do agree that this kind of system can get really messy/painful once you have a lot of data.

But I'm thinking more about:

3. "the Dynamo/Hadoop model, in the small": a single-process client-server row-store, but where a table is made of immutable, read-only 64MB chunks, with the newest chunk being an "open" buffer that becomes "closed" as soon as it's filled up; and where these chunks are of a known file-format such that you could directly compose them outside the DBMS in parallel and push them to the DBMS as chunks to be ingested "whole" (i.e. "mounted" as micro-partitions of the table);

4. where the business-domain's types are already fundamentally immutable datatypes, that don't need any "projecting" into a CQRS representation; i.e. where the thing the DB exists to store (and query on!) is the immutable business data, not some latest-computed-state reduction over it, so there's no need to ever play a log into a CQRS reducer, let alone replay that log.

I know that, for example, Clickhouse's MergeTree engine is at its core akin to the kind of system I'm describing in 3 above — but because Clickhouse is designed as a general-purpose RDBMS (and so still offers an API that includes UPDATE and DELETE) rather than being purpose-built for use-case 4 above, it needs to do a whole bunch of stuff on top of that constrained chunked-immutable storage primitive, to virtualize those updates/deletes pre-merge, and to present MVCC transactions. Same with, for another example, CouchDB: an immutable-data-store core, with a ton of logic on top to allow the user to pretend they can update/delete.

If you imagine a version of Clickhouse or CouchDB that was solely focused on delivering use-case 4 above, then you could strip away all the "extra stuff." For use-case 4, the "64MB immutable-once-full micro-partitions" paradigm is literally all that's needed to losslessly convey all domain state; and so a storage engine akin to the one described in 3 is all you need to support it.

(If you're wondering, the business domain I'm working in, where this pertains, is: analytical querying of blockchain data. All the "CQRS events" [blockchain blocks and their transactions] come from third parties, and are all guaranteed-immutable upon insert, if not guaranteed-canonical. [But canonicity can be tracked as a log of chain tip changes, like a git commit log.] If you don't care about blockchains, though, domains with similar needs for immutable append-only analytical stores include financial forensic accounting, and structured API audit-logging as state for rule engines inside Intrusion Detection Systems.)

It's fundamentally how SirixDB approaches this (basically also storing checksums) as also written in another reply :-)

Every commit directly syncs the binary data to the durable storage (currently a file) and incrementally adds data. Furthermore, it stores optionally the changes (type of change/ctx node/updatePosition... in JSON files). For instance, lately I've implemented a simple copy mechanism based on this. Copy a given revision and optionally apply all changes with intermediate commits to also copy the full history up to the most recent revision). However, the main idea is to use the change tracking also for diff visualizations... maybe even stream these via web sockets.

A production ready system BTW may be Datomic.

And it also reminds me of this paper: https://dl.acm.org/doi/abs/10.5555/3275366.3284969

That's more or less what Kafka is.

Kafka doesn't have fast random-access to a row-tuple in a stream by its primary key, let alone declarative indexing by compound keys / computed expressions.

Kafka with those things would be equivalent to what I'm describing, yes.

What about storing the data and thus, the indexes in Kafka. Would it make sense? Let's say currently, I'm storing SirixDB resources in files. However, instead of offsets into a file the index pages could be stored in Kafka optionally (or Pulsar...). Is Kafka too slow for this or only for specific row-tuples? We could make a combined storage caching the pages locally or also storing in the file system and asynchronous storing in Kafk, S3 or whatever.

The irony is an extremely popular general-purpose database like PostgreSQL can sometimes work better* for niche use cases than less popular databases designed specifically for that niche use case.

* better, or perhaps nearly as well such that it’s not worth learning how to use and maintain the less popular single-purpose database.

I certainly wouldn't recommend anyone start with a special-purpose DBMS. General-purpose DBMSes will get you quite far indeed. (They got my data API company to millions in ARR!)

But when you hit a certain scale, and your general-purpose DBMS starts to struggle with things like "delivering millions of sub-1ms reads per second", and you start to look at what would be required to scale horizontally to thousands of elastic nodes with low start-up times and good QoS while serving your 100s-of-TBs dataset... you might just start Greenspunning something. Something that you might not at first realize is a domain-specific DBMS. But, once you do realize that, you may wonder whether someone has already done it better.

And that (rather small) set of people, are the intended customers of a domain-specific DBMS.

I often wonder: given how much of the complexity of modern DBMSes exists because of constraints imposed by certain use-cases that only pertain in certain business domains... what efficiencies could we gain if we designed a domain-specific DBMS with the knowledge that use-cases outside of the domain are off-limits and can be ignored?

And how many serious problems would be caused by people not correctly understanding what use-cases could be ignored in their domain?

IMHO, complexity in the service of giving an easier-to-reason-about interface is usually well worth the cost.

Precisely this type of thinking resulted in S3, which was the answer to: “what if we drop everything in a traditional file system that is not strictly required”?

Several modern database platforms are similarly designed around one or more traditional interface assumptions being dropped.

IMHO, the biggest one is “interactive transactions” or alternatively: long-running transactions. Anything external to the database that can interact with it in one transaction requires locks and all of the complexity that comes with it, such as dealing with deadlocks.

Dropping the requirement for “interactive Telnet-like SQL shells” can massively simplify things.

Yes, I worked on a distributed analytical database that only supported bulk appends (no transactions). I think there are quite a few OLAP databases like that. You don't really need the ACID properties at all for bulk appends, just idempotence.

The „advanced” part of Andy Pavlo’s great great database course discusses classic database design compromises and the tradeoffs, among other things.

See http://www.cs.cmu.edu/~pavlo/

It's a lot easier to go distributed if you're append-only, that's for sure.

There are niche databases like that.

Check out https://evitadb.io/ for a very different use case. It has a rich query language supporting various e-commerce aggregations (as in faceting). They claim it benchmarks 100x faster than Postgres for this specific use case.

Although it looks very cool, if I'm not building an e-shop, I'm not using it. There will be some unfavorable tradeoffs for my case.

Check out TigerBeetle!

Please avoid using distributed systems when non distributed solutions suffice.

I would give the opposite advice.

Every non-trivial production system is a distributed system. At the very least your database is a replica set and therefore a distributed system. So avoid learning distributed systems at your own peril.

Check out https://jepsen.io/ and https://raft.github.io/

Every non-trivial production system is a distributed system.

In the HN bubble, yes. From the perspective of the average business, definitely not. Or at least, doesn't have to be.

Not to get reductionist but if you interact with more than one (e.g. Linux) process, it's already a distributed system.

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.

https://en.wikipedia.org/wiki/Distributed_computing

I'm unsure whether it's worth splitting hairs over the definition of what a distributed system is. If you have to linearize concurrent access to shared resources, that's where the bulk of your pain will come from.

In the distributed system case, at least as far as that term is commonly understood by technologists, that linearization also has to take into account odd behaviors in interconnects that you don't typically see within a single machine.

You wouldn’t classify storage as an odd behaving interconnect?

No. I classify storage (read: DASDs, not main memory) as universally cursed.

Nicely said.

Personally I view it all as a sliding scale with varying techniques coming into play depending on the parameters. Eg networked computers have physical separation and typically independent operational failure domains which is a + (a failing component doesn’t take out the system) and - (requires complex algorithms like distributed concensus and has unique performance issues due to the latencies involved between components). But fundamentally even a single computer can still be viewed as a distributed system and I don’t see that as splitting hairs. Your memory controller, storage controller have non trivial pieces of software running with their own failure modes. The kernel running on the CPU is another component. Sure you don’t need distributed consensus per se but for example some RAID configurations look an awful lot like what you’d do in a distributed storage system (erasure codes, monitoring the state of individual disks to know if recovery is needed when a component fails etc). There’s a lot of work to hide that distributed system complexity behind APIs and standards but that’s also true for web services (eg S3 exposes a decentish distributed storage interface and you don’t have to reason about the consensus problems it’s managing for you under the hood).

For example, take the Pixel Buds 2. That has 5 microprocessors communicating with each other and links can go up and down between them. There’s no consensus done and we did a lot to hide the messaging complexity but I would still count that as a distributed system (even more so once you factor in that you also need a host machine to pair with). And all the distributed systems problems came about (eg if you think a connection is there but then it disappears and you missed that signal and didn’t update your state).

I don't disagree, but for the sake of having productive conversations, it's easiest to acknowledge that this is a matter of one's frame of reference. Are you technically correct? Yes. How much does that matter at the end of the day? For most people, myself included, I'd argue not a lot.

As someone who straddles the system call boundary, I generally consider myself as not running in a distributed system if the layers of the stack below me will fall out from underneath me when some invariant of their abstraction around the distributed nature of my execution context fails to hold.

For PCIe-interconnected devices (except for HBAs and disk because that entire stack outright lies to you), the chances are pretty good that the kernel will bugcheck if it can't successfully remediate the problem.

For something connected over Bluetooth? Not so.

Yeah that’s a fair line and often how I like to think about things (ie are the failure domains coupled on error or not). The only nit is that it’s not uncommon for storage layers or really any HW to swallow errors and not even report it to the kernel. And that’s ignoring that silent memory errors happen a lot on most consumer HW.

Again, you aren't wrong, and it's actually a lament of mine that these peripheral processors aren't themselves under the purview of the kernel's scheduler for their tailor-made tasks or that the code that runs there isn't debuggable in much the same way as any other code.

It is, nevertheless, good enough for most people that it more or less works 80% of the time. Pareto strikes again.

Not to get reductionist but [...]

Then let's just avoid the pedantry altogether. When we're quizzed on DS, it's understood that we're not talking about simple IPC. It's not about your garden variety backend application and its dedicated postgres instance. We're talking Kafka, and all the accompanying jazz. Is it necessary now to pretend that we don't know what we're talking about?

No, I think multiple processes talking to each other even on a single machine is sufficiently complex to qualify as a distributed system. The same mistakes or broad distsys design decisions can happen there or with Kafka. Even if it's architecturally simpler.

Specifically, take a look at linearizability. It is a measurement of consistency in the CAP theorem. It's a property quantifiable in almost any system:

A concurrent system consists of a collection of processes communicating through shared data structures or objects. Linearizability is important in these concurrent systems where objects may be accessed by multiple processes at the same time and a programmer needs to be able to reason about the expected results.

https://en.m.wikipedia.org/wiki/Linearizability

Even the P and M of the classic LAMP stack already form a distributed system.

I agree, but in the context of this conversation I think we're just talking about databases/state. So "is M a distributed system on its own?"

I'd hardly call a replica failover a distributed system.

And even a master with read-only replicas is not what people usually mean by "distributed", because writes are not.

In practice, distributed usually means that data is sharded. And that's what you absolutely want to avoid unless you truly need it.

So then you wouldn’t call raft algorithm a solution to consensus for distributed systems?

It is strictly single writer at a time

It depends on your definition of "trivial". Most systems I have seen so far would have been perfectly fine on a big server.

Every non-trivial production system is a distributed system.

And now you have to define non-trivial. Just thrown like this it does not invalidate the suggestion to avoid needless complexity, since it's really what it's about. Not what is "technically" distributed.

Learning and using distributed systems are different things. Once you learn, can you exert some restraint to only apply what fits? Lately, there seems to be considerable effort deployed to transition simple and perfectly working systems to a more heavily distributed model, as if it's, at worst, zero cost. But then you look at the problem being solved, the scale of the solution, and it's obvious that the single postgres instance speaking to a monolith was just fine.

I think that's what the author's advice is about.

Some parts of your system may not be able to avoid network calls or distributed aspects. That doesn't mean introducing them everywhere doesn't cause a lot more complexity than you'd otherwise have.

If you are running shard-per-core on more than one core, you're a distributed system.

Distributed/redundant system is not a backup.

I would still give the "go with the simple solution" advice.

Systems are often unable to correctly save, backup and restore persistent state even for the "trivial simple storage" systems, never mind attempting to restoring state for a distributed storage solution in a disaster recovery situation.

Once you have a working backup solution in place, then you can adopt a distributed solution.

This article made me laugh because it's synonymous to me trying to start a project and going down different rabbit holes.

The author at step 1. of a project trying to pick the tech stacks, decides to read a book on databases to help with choosing a DB

Then proceeds to segue to writing his own database.

Then writes a blog about said process

I wonder how that original project has come along? I'm looking forward to when the author gets to the stage of picking a frontend framework

Yak Shaving: https://seths.blog/2005/03/dont_shave_that/

When doing personal projects I have to constantly be reeling myself back in from doing x thing "The Right Way", because I end up doing a bunch of useless crap and not actually making progress on the personal project.

Easy to fall into that trap when 1) it's just you and 2) there is little accountability because it's just you!

My tactic for pushing back against this is to try to trick myself into doing the simplest thing that might still work. It's a challenge to write "bad" code on purpose. The opposite of chasing perfect/clean.

I have found that this frees up a lot of weird expectations that you place yourself under. You can get much more creative when everything is a dumb-ass-simple public static member vs when you are spending 2 hours a day fighting a cursed IoC/DI abstraction simply because you are worried that clean code Jesus might be watching you.

It helps to have an end goal too. It's really easy for me to push through a messy prototype when I can see it bringing me closer to a strategic objective.

Bingo. First get it working, then get it right, then get it fast. It's for this reason that almost all of my projects start with a SQLite database - it's a program I'm very familiar with, like an old reliable chef's knife.

Don't forget about the part where you actually do start on the project, but then you read one article or find another tool/software package that makes you second guess everything you've already done and you go back down another rabbit hole.

IME this "second-guessing" is more often right than wrong. You can always return to a project that motivates you, but you can't get back time spent digging a hole deeper, and often it leads to tunnel vision and bad habit formation.

Not every "project" needs to become "finished" or "product".

My problem is scope creep. It's much harder to tell myself no vs. being on an engineering team at a company and there being a set process.

One day :)

Yeah this is me too apart from the writing a blog part because, uh, why would I want to expose the rest of humanity to my insanity?

It is pretty funny. That said, if it's just a personal project then sometimes it's more about the journey -- smelling every flower is the enjoyable part of the journey. Sometimes.

I mentioned smelling the flowers because I look to young kids for reminders about the little things we sometimes forget to enjoy along the way, even if it's just the short journey from the car to the house. When you're not in a hurry, remember to enjoy the wonderful things that lie in your path.

Many people learn databases by learning sql.

I recommend people learn DBs by doing a class like this and understanding b-trees.

Nearly Everything about the strengths & weaknesses of RDBMS is understood through knowing btrees and how they affect key insertion, lookup & ordering.

Most people try to speed up a DB by adding indexes -- but when you realize you're just adding one tree on top of another you're just masking the problem.

Some problems are well suited toward b-trees, and many aren't.

SQL is just a query interface to a remote b-tree system.

I think this is too reductive. A b-tree isn't the only indexing strategy, and it's well understood that an index is intended to improve read performance at the expense of write performance because the common case is that your DB handles a lot more reads than writes.

but when you realize you're just adding one tree on top of another you're just masking the problem.

What is the problem and how do you propose to solve it without touching indexes? They are a hard requirement for any decent sized table.

If you are using oracle, mysql , mssql you are typically using b-tree index. mysql supports hash index but i rarely see it applied in prod (probably just habits or lack of familiarity).

Yes, but a lot of the cool, new DBs use LSM trees. The post we're currently discussing also spends more time explaining LSM trees + WAL, than it does B-trees.

But your point is still valid. :)

Using data structures that match the performance profile. if the DB doesn't support that, using another solution.

I agree with this. B-trees vs hash indices, I/O hierarchy, process model, etc.

Also in the modern day, it's valuable to learn common strategies behind column-oriented databases such as late tuple materialization, deferred execution, linear scans vs. binary search, instruction pipelining.

Once you're familiar with all this, you'd find sometimes in the field what you really need is a simple flat file or an embeddable database like RocksDB, not a DBMS.

"I agree with this. B-trees vs hash indices, I/O hierarchy, process model, etc."

I'm familiar with all of the concepts in the context of databases except "process model." Is this the same as data model? Could you elaborate if not?

There are several types of process models that's suitable for a DBMS: 1 process per DBMS worker, 1 thread per DBMS worker, a lightweight thread pool, a process pool, etc.

Hellerstein, Stonebraker and Hamilton (2007) is a great introduction on this.

Well, it may be a B-tree, or an LSM-tree, a trie or whatever index structure suits...

Also, of course you may have covering indexes.

Great article. The book Database Internals looks amazing. Are there any other such books that deep dive into internals?

There is a another famous one, focusing on Postgres: https://www.interdb.jp/pg/.

Is it available as a print or ebook?

addressed in the FAQ: https://www.interdb.jp/pg/faq.html. Short answer is no.

Not a book, but I recommend the database class lectures from @apavlo's group at CMU.

https://www.youtube.com/c/cmudatabasegroup

All the classes (intro and advanced) are online, as well as presentations and lectures about industry products.

They are very useful.

Also, from a more high level theoretical CS and less physical-implementation focused POV, the "Alice" book ("Foundations of Databases") is amazing (though very dense and mathematical). Focuses on the relational algebra and Datalog (and the translation of that into the Rel Algebra). Getting print copies is hard now (my used copy arrived with a broken spine and pages falling out), but the entire book is online: http://webdam.inria.fr/Alice/

Another excellent DB textbook covering both theory and implementation is Weikum & Vossen:

https://www.google.com/books/edition/_/wV5Ran71zNoC?hl=en&gb...

In case you can't afford to donate $150 to Elsevier: https://www.dropbox.com/scl/fi/r0fros5y2kzeqv57v7myz/Transac...

Another book I found very useful (though this DI book is more modern) is Raghu Ramakrishnan Database Management Systems book.

This is a good paper for a similar kind of overview: https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf

Very nice article and looks like a great way to get hands dirty for anyone wanting to think like a DB engineer. Related question, for all the Database engineers out there. I have always had a keen interest to work on Databases. Having worked at one of the cloud providers and burn my high amount of ops and oncall, I am of the opinion most of the Database engineers usually have a bad wlb. Given the depth and complexity of all Database internals, Any major impact would take a few quarters to roll out, and its not a job for everyone. Is this mostly true?

I'm the de facto DB guy at work, WLB is fine... it's very rare that I have to work after-hours. The main reason for this is that our DBs are basically exclusively used during business hours, and if you set things up right it's rare that you'll get anything breaking on its own after-hours.

Also - maybe this is specific to me, but most emergencies I deal with are of my own creation. It's rare to have a DB spontaniously corrupt a page. Most emergency issues I deal with are performance-related (this used to take 2s, now it's taking 200s) and you can mitigate those ahead of time if you're thoughtful (breaking code up so you lean less on the query optimizer, automatic index maintenance, consistent and well-indexed join paths).

Depends what you mean on major impacts. A potentially breaking upgrade is a total nightmare, and never a fun process. But most (almost all?) of the things you do are much lower stakes, and definitely do not take a few quarters.

I read the commenter as referring to working on DB internals -- that is, building a database -- rather than working supporting workflows/queries on an existing DB product. But maybe I read wrong.

I worked at a DB product startup, doing DB internals development [storage engine, etc], for a few months (before my personal/family situation forced me to have to move on), and WLB seemed pretty reasonable -- though there was a mandatory on-call rotation, that's kind of expected at most startups in the cloud space.

People were very smart and friendly, and took vacations and all that kind of stuff. I'd say the "worst" part of working there was the amplification of my impostor syndrome?

I don't know if that's typical for the industry as a whole. I'd definitely like to get back into DB internals dev at some point, it's one of the few places in our industry where you get to do fundamental systems dev. The job I had was in many ways my "dream job" but family situation at the time meant I couldn't manage it.

I had the most fun of my entire career writing a DB from scratch at my previous job. It was fun while it lasted (i.e., until the money ran out)...

I think it depends on the environment. Yes, most production and even some test databases are critical to businesses. Any downtime or severe performance issues cause a lot of finger-pointing.

Where I work, most database patches are applied to test environments and let soak for a couple of months before rolling out to production.

And yes, I work a lot on weekends. My only complaint about my career as a DBA is that most of our work is behind the scenes and goes unnoticed until something breaks.

I can't judge your personal position/responsibilities but normally I think of DBA and database developer as not being the same thing (except for at small database startups where there's more overlap). At medium+ -sized database companies I've talked to the database developers are on call but likely don't have access to production. They are on call for bug reports from production users. Whereas DBAs are folks directly responsible for the database in production.

I have worked on internals of databases for a few years. WLB is just fine, we have oncall rotations same as everyone else. They are deep and complex systems, but mostly what that means is just that average tenure is very high as it takes forever to get up to speed and people generally don't get bored and leave, unless you give them good reasons to.

Overall I enjoy it a lot and would generally recommend it as a good field to get into if you want to tackle hard problems.

Edit: people seem to be conflating DBAs with engineers working on database internals. Those are very different jobs and I'm only talking about the second one.

The atomicity in the bash version can be "simply" achieved by copying the file to a temporary, modyfing it, then using `sync; mv; sync`, right?

Oh you're right I'll add that in later.

Use `look` to get `O(log(n))` lookups (writes are still slow, but you could use a multi-level chain I guess). `join` does not use binary search even though it could have.

no waaay, I have never seen look before, thanks!

I would go : 1. count records 2. make copy 3. insert 4. ensure records = records + 1 5 if records != records +1 or file can be grepped, stat, filesize, then assume is corrupted and rollback to copy

Also, while you're copying you can inverse grep filter to avoid duplicates.

And since you're copying, you could maybe ensure sorting, but I don't think doing that with "bash" (plus off the shelf utils) makes a lot of sense.

That's what DJBs CDB (cdbget, cdbmake, etc) is for: https://cr.yp.to/cdb.html

This is a great article!

But fsyncgate makes me cry inside. It sounds like PostgreSQL made the decision to crash instead of retry forever upon fsync errors (which is the wrong way to do it). But it wasn't their fault, because OSs and filesystems don't always perform correctly when they run out of disk space. And they were able to avoid data loss by continuing from the last point in the log before the crash (and maybe have a supervisor process relaunch the database) so in the end, it may not matter.

The 2010s were a strange time. People went to great lengths to work around perceived limitations in the tooling. Stuff like, MongoDB would sometimes lose data, even when it claimed that a write completed. And people were ok with that, because they wanted the speed and freedom in order to iterate quickly and make money.

When we really should have been formally addressing problems from first principles. Because if our OS can't recover gracefully when the disk fills up, then we don't really have anything. And so many OSs (including our desktops) panic when the disk fills up, that we don't really have anything. At a low enough level, all of the tools we rely upon are trash.

But the point I wanted to make is: there's no such thing as exceptional behavior, there's only how many edge cases have been handled. Programs can be converted from imperative to functional by replacing exceptions with blocking (or spin-locking) retries so that their logic eventually finishes executing. In this case, by letting programs wait until the user has cleared off disk space, rather than forcing the user to orchestrate relaunching processes. In other words, synchronous blocking logic can be converted to a spreadsheet (flowchart) and reasoned about. But asynchronous nonblocking logic allows state explosion by forcing the user to deal with meta eventualities outside the core business logic.

So it's understandable how PostgreSQL made the decision to go with a nonblocking fault instead of blocking in 2018. But in 2023, a better approach might be to write something like a container with proper fsync behavior to protect PostgreSQL from immature filesystems.

I really envy your optimism, but no, filesystems at 2023 are fucked too, both on the host and the guest. Disks are still fucked in that they will tell you they saved your data, and still lose it due to some failure, or when they are full. The disk communication bus is a bit less fucked though, so that's improvement.

But anyway, you still can't rely on good behavior from your fsync, or even that a successful fsync means your data is safe.

there's no such thing as exceptional behavior, there's only how many edge cases have been handled.

Some edge cases can be hard to handle. What if the processor takes the wrong branch? What if one or more bit flips happens? At some point it is just better to crash than try to and actually handle such an edge case.

When we really should have been formally addressing problems from first principles. Because if our OS can't recover gracefully when the disk fills up, then we don't really have anything. And so many OSs (including our desktops) panic when the disk fills up, that we don't really have anything. At a low enough level, all of the tools we rely upon are trash.

Unfortunately, the OS can't really "recover" most of the time because it is being constantly lied to by several(!!!) layers of uncooperative hardware/firmware. It's zebras all the way down:

https://www.youtube.com/watch?v=fE2KDzZaxvE

In this case, by letting programs wait until the user has cleared off disk space

If your system is really in a no-disk-space-left state you might not be able to login to fix it:

https://unix.stackexchange.com/questions/165427/cant-login-d...

And even assuming you can login, sometimes you can't delete files either; you need to free space in another way before being able to delete files.

https://serverfault.com/questions/478733/rm-can-not-remove-x...

Filesystems and persistence are hard.

Nice article!

There is a bug in the compaction method:

  def compact(sstables, output_sstable):
    # Ordered by ascending key. pop() results in the item of  the smallest key.
    heap = heapq.heapify([(sstable.next(), sstable) for sstable in sstables])

    while (item, sstable) := heap.pop()
        if not item.is_tombstone():
            output_sstable.write(item)

You should only skip tombstones when you are compacting the final (i.e., largest) level, instead of between every level. Otherwise, an entry in a lower level will unmask itself because the tombstone in the upper level was compacted away.

It's one of the properties of LSM-based DBs that deletions/tombstones records linger for a long time, though some databases (eg RocksDB) put in some optimizations to get around this.

You're right this was purposefully left out for brevity, in dbeel it is handled.

Ah, makes sense :)

What kind of optimizations does RocksDB do? I know they have some stuff around range deletions but not sure I’ve read anything about point deletions.

"Consistency - This one doesn't really belong on ACID as a property of databases, as it is a property of the application." Damn, that is not good on so many levels.

First of all, ACID are not properties of a database, they are properties of database transaction. Article kind of says it, but I feel it is phased ambiguous. Consistency is the property of a database transaction which can be easily illustrated by foreign keys.

    AUTHORS
    | author_id | name |
    |         1 | Mary |
    |         2 | John |

    ARTICLES
    | article_id | author_id | title |
    |         11 |         2 | About small animals |
    |         12 |         1 | About big machines |

Now ARTICLES cannot have author_id non existent in AUTHORS. If a tuple is inserted in ARTICLES, then author_id should reference a valid tuple from AUTHORS. If a tuple is deleted from AUTHORS, then all ARTICLES tuples referencing said AUTHORS tuple should be deleted too; or deleting from AUTHORS should be rejected. That is consistency and that is a property of a database transaction. Transaction must fail if one tries to insert an article with author_id equal 5. At the end of transaction article 11 must be deleted if author 2 was deleted.

Now real life databases are much more complex, but idea is basically the same.

Also, consistency is between tuples. Not having "31-February-2001" as a date is integrity, not consistency.

Thanks, that's something I didn't know!

I'll fix it later, and change the wording to say that these are properties of a database transaction

I think the "C" in ACID indeed refers to application-defined integrity constraints (or "invariants"). Referential integrity is just one example; unique key constraints are another. This has nothing to do with the "C" in CAP, which is best interpreted (IMO) as linearizability.

(Aside: IMO the "C" in ACID implies "I" and specifically serializability, because that is the only isolation level guaranteed to preserve arbitrary application invariants.)

damn right sir, and there is also the term "consistency levels" which vary from one db to another.

I recall an hn post a bit back that cataloged all the ways POSIX guaranteed atomicity, which might help bashdb. Rename is the first thing I’d explore.

Maybe this from 2010 is what I was remembering from a repost, alas the link has rotted:

https://news.ycombinator.com/item?id=1035100

https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall...

This is great, maybe I'll try to improve bashdb

What tool do you use to make those diagrams? They look great!

https://excalidraw.com/

Thanks!

Tangential but I'm a bit blown away by that mongoDB mention - that it just flushes every 100ms, so you can lose writes. The link in the article is invalid, but from https://www.mongodb.com/docs/current/core/journaling/ we see:

"In between write operations, while the journal records remain in the WiredTiger buffers, updates can be lost following a hard shutdown of mongod."

Does Mongodb acknowledge writes as soon as they've hit the buffers or do they wait for an fsync? Because if the former, that's... shocking for a database people use as a source of truth.

Thanks for pointing out the link is bad! I just fixed it.

And yeah this was shocking to me as well.

Writes in MongoDB are persisted before the server returns acknowledgement. For a typical replica set deployment, this additionally means replicated and made durable on disk for a majority of nodes.

MongoDB has tunable durability guarantees; clients can opt-out of this behavior.

The layers are deep, but this is the function that handles waiting before acknowledgement:

https://github.com/mongodb/mongo/blob/20f42d9dc89999d119f35a...

Quick question: what are the NP-hard problems in the database area?

Solving for constraints and inverting queries i.e. from query -> data with filtering and all that.

One quick comment on consistency - There is db consistency and app consistency. For e.g, you can achieve A, I and D on one table at a time, but fail on a multi-table write.

Once you deal with transactions which can update multiple tables at the same time, then consistency comes to play. All the tables should be updated at the same time or none.

Oh that is a great example, I'll update the post tomorrow.

"So basically it has a document API like in MongoDB with leaderless replication like in Cassandra and thread-per-core architecture like in ScyllaDB."

Very cool design. And all in Rust!

Thanks :)

I love how the article starts out by demystifying "database" by implementing a trivial one as a bash "one liner". Great hook.

There has been nothing man made in the world that is more elegant and beautiful than Unix.

"Database Design and Implementation" by Edward Sciore is also a very great read with a lot of examples written in Java (actually a small DBS).

For actual research I'd recommend all stuff from Andy (Pavlo), Viktor Leis, Thorsten Grust, Thomas Neumann...

Great read! Thank you! Now do OLAP :p

Articles like this remind me I could basically make an entire career and some version of every web service I could ever want off of PostgreSQL, Django, and React at this point. We're living in good times my friends.

That page should be titled _transactional_ DB fundamentals. If you're doing analytical work, all of those tree-like structures are just uselessly slow, for example.

I was hoping to see an LSMT implementation in bash and left slightly disappointed.

Enjoyed reading it! Great overview of different concepts involved while building DB's. From mentioning SIMD to squeeze out performance on a single machine all the way to consensus algorithms.

While on the topic of DB's, reliability and distributed systems. Formal methods and how they can be applied in these situations and formally apply to Database internals for anyone else wishing to read up on as another concept.

Interesting paper on the S3 team using TLA+ to model.

[0] Use of Formal Methods at Amazon Web Services https://lamport.azurewebsites.net/tla/formal-methods-amazon....

[1] How Amazon Web Services uses formal methods https://www.amazon.science/publications/how-amazon-web-servi...

Database is the most interesting topic that I could not stop curious about it. I used to hosted a techtalk for my teammate about how to write a db with LSM tree ans SSTable but I could not go so far as your article Sent it to my teammate and hope they can use your article as a note to explore more topics about the database Thanks for amazing article

Slightly offtopic; I am quite well vetted in the relational database models (starting with the Date book in uni in the 80s; I worked with DBs before that, but didn't know how they worked or relational theory), however are there similar theories for other types of databases; more specific graph databases? But any other type would suffice. For instance, I can find a lot of implementation detail about columnar databases, but not a lot of theory (probably my inability to search for the right phrases).

Thank you very much for putting the article together