Seems weird to start with “not talking about using something like SQLite where your data is still serialized”, then end up with a home grown transaction log that requires serialization and needs to be replicated, which is how databases are replicated anyway.
If your load fits entirely on one server, then just run the database on that damn server and forget about “special architectures to reduce round-trips to your database”. If your data fits entirely in RAM, then use a ramdisk for the database if you want, and replicate it to permanent storage with standard tools. Now that’s actually simple.
I do feel like this largely summarizes as "we built our own sqlite + raft replication", yeah. But without sqlite's battle-tested reliability or the ability to efficiently offload memory back to disk.
So, basically, https://litestream.io/ . But perhaps faster switching thanks to an explicit Raft setup? I'm not a litestream user so I'm not sure about the subtleties, but it sounds awfully similar.
That overly-simplified summary aside, I quite like the idea and I think the post does a pretty good job of selling the concept. For a lot of systems it'll scale more than well enough to handle most or all of your business even if you become abnormally successful, and the performance will be absurdly good compared to almost anything else.
Rqlite would be a better comparison. It is actually SQLite + raft
https://github.com/rqlite/rqlite
rqlite author here, happy to answer any questions.
So some dumb questions if you don’t mind
- In GitHub readme you mention etcd / consul. Is rqlite suitable for transaction processing as well ?
- I am imagining a dirt simple load balancer over two web servers. They are a crud app backed onto a database. What is the disadvantages of putting rqlite on each server compared to say having a third backend database.
It depends on what kind of transaction support you want. If your transactions need to span rqlite API requests then no, rqlite doesn't support that (due to the stateless nature of HTTP requests). That sort of thing could be developed, but it's substantial work. I have some design ideas, it may arrive in the future.
If you need to ensure that a given API request (which can contain multiple SQL statements) is atomically processed (all SQL statements succeed or none do) that is supported however [1]. That's why I think of rqlite as closer to the kind of use cases that etcd and Consul support, rather than something like Postgres -- though some people have replaced their use of Postgres with rqlite! [2]
[1] https://rqlite.io/docs/api/api/#transactions
[2] https://www.replicated.com/blog/app-manager-with-rqlite
Thank you - so my takeaway is that rqlite is well suited for distributed “publishing” of data ala etcd, but it is possible to use it as a Postgres replacement - thank you I will give it a go
As for your second question, I don't think you'd benefit much from than that, for two reasons: - rqlite is a Raft based system, with quorum requirements. Running 2-node systems don't make much sense. [1] - Secondly, all writes go to the Raft leader (rqlite makes sure this happens transparently if you don't initially contact the Leader node [2]). A load balancer, in this case, isn't going to allow you to "spread load". What is load balancer is useful for when it comes to rqlite is making life simpler for clients -- they just hit the load balancer, and it will find some rqlite node to handle the request (redirecting to the Leader if needed).
[1] https://rqlite.io/docs/clustering/general-guidelines/#cluste...
[2] https://rqlite.io/docs/faq/#can-any-node-execute-a-write-req...
I'll throw in a "ehh... sorta" though rqlite is quite neat and very much worth considering.
The main caveat here is that rqlite is an out-of-process database, which you communicate with over http. That puts it on similar grounds as e.g. postgres, just significantly lighter weight, and somewhat biased in favor of running it locally on every machine that needs the data.
So minimum read latency is likely much lower than postgres, but it's still noticeable when compared to in-process stuff, and you lose other benefits of in-process sqlite, like trivial extensibility.
There's dqlite, which is a in-process sqlite with raft.
https://dqlite.io/
They basically only save on serialization & deserialization at query time, which I would consider an infinitesimal saving in the vast majority of use cases. They claim to be able to build some magical index that's not possible with existing disk-based databases (I didn't read the linked blog post). They lose access to a nice query language and entire ecosystems of tools and domain knowledge.
I fail to see how this little bit of saving justifies all the complexity for run-of-the-mill web services that fit on one or a few servers as described in the article. The context isn't large scale services where 1ms/request saving translates to $$$, and the proposal doesn't (vertically) scale anyway.
You should probably RTFA before making broad assumptions on their solution and how it works. Most of what you wrote is both incorrect and addressed in the article.
Telling people to RTFA is against site guidelines. And I read the entire article before making this comment. If you think I’m wrong, you reply with what’s wrong, not some useless “you’re wrong, RTFA”.
The only thing in my comment that’s not directly based on the article is a handwavy 1ms/request saving estimate, and since they don’t provide any measurement, it’s anyone’s guess.
Is telling the people to RTFA against the guidelines?
The guideline specifically advises to do what GP did: Instead of commenting whether or not someone read the article, to tell them that article answers their questions.
One thing I forgot to mention: if you use a not-in-process RDBMS on the same machine you also incur some socket overhead. But that’s also small.
SQlite doesn't do Raft. There isn't any simple way to do replicated SQlite. (In fact, writing your own database is probably the simplest way currently, if SQlite+Raft is actually what you want.)
What about rqlite?
I think it's important to understand that every startup goes through three phases: Explore, Expand, Extract. What's simple in one phase isn't simple in the other.
A transactional database is simple in Expand and Extract, but adds additional overhead during the Explore phase, because you're focusing on infrastructure issues rather than product. Data reliability isn't critical in the Explore phase either, because you just don't have customers, so you just don't have data.
Having everything in memory with bknr.datastore (without replication) is simple in the Explore phase, but once you get to Expand phase it adds operational overhead to make sure that data is consistent.
But by the time I've reached the Expand phase, I've already proven my product and I've already written a bunch of code. Rewriting it with a transactional database doesn't make sense, and it's easier to just add replication on top of it with Raft.
Having Explored with a transactional database: I really can't agree. Just change your database, migrations are easy and should be something you're comfortable doing at any time, or you'll get stuck working around it for 100x more effort in the future.
That was the biggest disconnect I had as well. SQL db have the _best_ data migration tooling and practices of any data system. It’s not addressed in the article how migrations are handled with this system but I’m assuming it’s a hand rolled set of code for each one.
I think sql db make the most sense during the explore phase and you switch off of them once you know you need an improvement somewhere (like latency or horizontal scalability).
Good question!
And this comes to the difference between Explore phase and Expand phase.
In the Explore phase, data migration was just running code on the production server via a REPL. Some migrations such as adding/removing fields are just a matter of hot-reloading code in CL, so there weren't a lot of migrations that we had to manually run.
In the Expand phase, once you add replication, this does become hard and we did roll out our own migration framework. But by this point we already have a lot of code built out, so we weren't going to replace it with a transactional database.
Essentially, we optimized for the Explore phase, and "dealt with the consequences" in the Expand phase (but the consequences aren't as bad as you might think).
I'd assume in the beginning you do not want to spend time writing a bunch of highly difficult code until you've proven your idea/product. Then when you're big enough and have the money, start replacing things where it makes sense. It seems to be the strategy used by many companies.
Unless, of course, your startup is in the business of selling DBMSes.
Absolutely. By the way, if it wasn't clear from my blog post, in the Explore phase, I used an existing library to do this. It was only in the Expand phase that I put this existing library behind a Raft replication.
I’ve done tons on traditional application server + database on the same server projects. There’s zero infrastructure issue there. You keep implying that a not-in-process RDBMS has to be its own server and that’s super strange. Not to mention having a separate db server also doesn’t add much overhead at all in the early stage, even if you’re doing it for the very first time (been there, done that).
Agreed. Reinventing the WAL means reinventing (or ignoring) all the headaches that come with it. I got the impression it takes them a long time to recover from the logs, so they likely haven't even gotten as far as log checkpointing.
But if the blogger learned SQLite, how would they have a topic to blog about?
Also, no benchmarks. It's quite odd that an argument grounded on performance claims does not bother to put out any hard data comparing the output of this project. I'm talking about basic things like how does this contrived custom ad-hoc setup compare with vanilla, out-of-the-box SQLite deployment? Which one performs worse and by how much? How does the performance difference reflect in request times and infrastructure cost? Does it actually pay off to replace the dozen lines of code of on boarding SQLite with a custom, in-development, ad-hoc setup? I mean, I get the weekend personal project vibe of this blog post, but if this is supposed to be a production-minded project then step zero would have been a performance test on the default solution. Where is it?
I probably did a bad job then, because everything in the blog post was meant to be developer productivity claims, not performance claims. (I come from a developer productivity background. I'm decent at performance stuff, but it's not what excites me, since for most companies my size performance is not critical as long as it scales.)
The OP starts out by talking about periodically dumping everything in RAM to disk. I’d say that’s your checkpointing.
Trading systems bluntly keep everything in RAM, in preallocated structures. It all depends on the kind of tradeoffs you're willing to make.
I used to work on a telecom platform (think something that runs 4G services), where every node was just part of an in-memory database that replicated using 2PC and just did periodic snapshot to avoid losing data. Basically processes were colocated with their data in the DB.
Very erlang/otp. Joe Armstrong used to rant to anyone who would listen that we used databases too often. If data was important, multiple nodes probably need a copy of it. If multiple nodes need a copy, you probably have plenty of durability.
Even if you weren't using erlang, his influence (and in general, ericsson) permeates the telecom industry.
I worked on a lottery / casino system that was similar. In memory database ( memory mapped files), with a WAL log for transaction replay / recovery. There was also a periodic snapshot capability. It was incredibly low latency on late 90's era hardware.
You don’t even need a ram disk imho, databases already cache everything in memory and only writes reach the disk.
Just try and cold-start your database and run a fairly large select twice.
Also the OS will cache a lot of the reads even if your database isn’t sophisticated enough or tuned correctly. Still could be a fun exercise, as with all things on here.
Any half decent DBMS bypasses the page cache, except for LMDB.
Then you get used to near-zero latency that in-RAM data gives you, and when it outgrows your RAM, it's a pain in the butt to move it to disk :)
Setting up a single server with database replication and restore functionality is arguably more complex then setting this up.
There are libraries available to wrap your stuff with this algorithm, and the benefit is that you write your server like it would run on a single machine, and then when launching it in prod across multiple, everything just works.