A distributed systems reading list

When people learn about distributed systems outside of work, how do they actually get hands on experience with it (assuming they don't go spinning up a bunch of machines on aws/gcp/azure/etc)? I find it easiest to learn by doing, writing simple proof of concepts but that seems a bit harder to do in this area than others? What is the hello world/mnist of messing around with distributed systems?

Frankly, just build something.

Use a small k8's distro (kind, minikube, k3s) and build something that talks amongst itself and is resilient.

Sites like leetcode are great for coding and improving, because you get to compare your solutions to those of others. Sadly just building something on your own helps you learn the moving parts, but not optimal, neat, or best practice solutions.

Sites like leetcode overindexes on the rote abilities that you can stamp out. Actually building something is exploring in a creative way which builds a deeper understanding.

If you want to be "optimal, neat, or best practiced" read a book, and get stuck in tutorial hell. If you actually want to learn how to do something, literally go do it. Nobody has ever built anything of value (whether that is financial, intellectual, or emotional) by leetcoding.

Any suggestions for books to read?

What you should read or start with are the designs for Cassandra, Kafka, Foundation DB, etc.

The problems they're trying to solve are related to really large distributed systems that fail a lot, and their design decisions are basically a "this is how we worked around that problem."

You can also look for the LISA archives (https://www.usenix.org/publications/loginonline/thirty-five-...). System administrators were the first people that had to deal with large distributed systems at scale, and university system administrators led the charge.

You might want to hunt down the comp.sys.admin archives (I can't remember the newsgroup anymore).

Most of the ideas and issues behind distributed computing are obvious if you think about it. Many of the actual implementation and mitigation of those are not obvious, though.

And there's also the client side of distributed computing, which I don't think is discussed as much.

As an example, exponential backoff is one of the go-to techniques for clients when the servers are under load. Unfortunately that doesn't really work IRL, because instead of spreading the load you get waves of load coming back over and over. Likewise on the server side you have problems with peak load.

The only one I would really suggest is Designing Data Intensive Applications. But it is very DB centric.

https://www.amazon.com/Designing-Data-Intensive-Applications...

"Just build something" is good advice but it's easier to find some kind of fun thing to build with, say, a web framework that's educational and maybe not a complete throwaway either.

Maybe some Internet of Things applications would provide a good avenue for some distributed systems exploration?

The easiest way is to fire up a bunch of VMs.

The cheapest way is to pick up an old ThinkStation (or other tower), load it up with 128GB (or more) of ram and install ESXI on it. That's a perfectly good baseline, and you can run about 30 4gb linux VMs on it.

Ideally you'd have a bit less than 1 core per VM, just so it's a bit slow. Lots of people assume your nodes are quick, but in real life they may not be. And really, most of the time your machines won't be doing squat.

You might want to have SSDs in there too, because ESXI doesn't have RAID capability (or at least mine didn't). I don't think you can get a cloud device that uses spinning disks anymore, and you wouldn't use it in real life anyway.

A 2tb drive is cheap these days, or just slap all those old small SSDs in there. Everyone has a bunch of those small SSDs left over, and they're perfect.

ESXi needs the RAID to be handled by another device, the simplest case is a hardware RAID card with disks locally attached to it. You can also attach remote disks/volumes from other systems, with or without RAID, over the network/SAN/etc using an HBA, special network card, or the software iscsi initiator stuff in ESXi. You can even have something like a windows server act as the iscsi volume host, and attach to it over the normal network if you don't really care about reliability. The ESXi OS will not appreciate it if you ever turn the remote volume host system off, or if the network drops out. It's really too bad the free and cheap ESXi licenses are going away, it was always so nice to work with...

That's right, Broadcom bought them and the party's open. Download your ESXi while you can!

Most systems are "distributed" actually, even your CRUD apps and CLI tools that write to disk. The better question is "how do you learn to deal with distributed systems intricacies in places where it matters (such as finance)?", the answer for which is super simple;

  1. Write any stateful program
  2. Now look at every single LOC and imagine what happens to the system if the service crashes before executing the next LOC. Then modify the system to deal with those scenarios.

Most systems are "distributed" actually, even your CRUD apps and CLI tools that write to disk.

Yeah, people miss this. If your app interacts with another app - bam distributed.

Its almost as if the world isn't single threaded

FWIW, I built hraftd[1] many years ago to make it easy to play with a simple distributed system, but one that uses a production-grade implementation of Raft[2]. You can spin up a cluster in seconds on a single machine, kill nodes, watch a new Leader get elected, and so on.

It's written in Go, so it'll help if you are familiar with Go. But the code is not difficult to understand even if you don't.

[1] https://github.com/otoolep/hraftd

[2] https://github.com/hashicorp/raft

Oh, and more background here: https://www.philipotoole.com/building-a-distributed-key-valu...

I don't think it's a trivial thing to do outside of work. At most you can play with kubernetes and cloud but in an interview the lack of experience will come out because I think some stuff can only be learned at work. Especially scalability.

but in an interview the lack of experience will come out

Some people are just up front about it - I've read a lot, and practiced the best I can, but am looking for some real world experience to marry that too.

I've been using libvirt on a decently large Linux box I keep under my desk.

I have my current thing set up to create VMs from a downloaded cloud-VM image with minimal updates (add my ssh pubkey and install Python), and then use Ansible for everything further.

You take traditional non-distributed systems and push them to their limits in some regard.

"Singularity" systems are an abstraction afforded to us by the grace of the hardware we run them on. If you start pushing their performance hard enough, however, you inevitably get distributed behavior.

This is also a good potential career reason to try to make software which is as performnt as possible - you'll get all the tasty edge cases and complexity war stories to talk about.

how do they actually get hands on experience with it

By designing twitter in a 45 min interview.

It's not as easy as playing with more 'normal' stuff, but I usually use VMs on a local hypervisor like ESXi, or a bunch of old desktop/server hardware if I have enough space/power/cooling at the time. Winter helps, big stuff often runs loud and hot. To get specialized hardware when needed, ebay or 'trash' from work and such can help a lot.

Gossip Glomers might be fun if you’re looking for some hands-on exercises :)

https://fly.io/dist-sys/

1. A bunch of communicating local processes.

2. A bunch of communicating local VMs (easier with a beefier machine like my current desktop).

3. Mininet (there are other options) to simulate a network environment, can fully control the topology very easily. Lighter weight than (2), more control for simulating different network effects than (1) alone.

I recommended “Understanding Distributed Systems: What every developer should know about large distributed applications” by Roberto Vitillo to all my colleagues back when I worked on SaaS systems.

“Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems” by Martin Kleppmann as the more advanced deep dive.

Both books provide timeless conceptual advice. Kleppmann’s description of developing a database by starting from an append-only text file really stuck with me.

I often like to think that, at a basic level, all a [edit: indexed] db "does" is move our O(n) search of an unordered text file to the O(log n) search of a tree

if the facet is indexed.

ah yes, thanks

Yup.

From a high-altitude view, that's why splitting a huge database table into smaller partitions is not an automatic performance win. If you have M partitions with N rows each, then a lookup might require O(log M) time to find a partition and O(log N) time to find a row within the partition. But O(log M + log N) = O(log MN) which is what you would get from a single big table with appropriate indexing.

Of course, in the real world constant factors and implementation details matter, so this is just a heuristic. But it seems to run contrary to a lot of novice programmers' intuition that a large DB table must automatically be a slow one.

The book that Dominik Tornow is writing “Thinking in Distributed Systems” has been an excellent next read after DDIA for me (it’s not yet finished I believe).

Really shows the experience of someone who understands this stuff inside and out (was one of the main people behind Temporal).

FWIW I don't see mention of incompleteness on the book's site http://book.dtornow.com/

To add to this list, there is also "Principles of Eventual Consistency" [0] for getting down to the mathematical formalisms.

In addition, Lamport's paper "Time, Clocks, and the Ordering of Events in a Distributed System" [1].

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[1] https://lamport.azurewebsites.net/pubs/time-clocks.pdf

Lamport's paper "Time, Clocks, and the Ordering of Events in a Distributed System"

I know this article is a classic. I studied it at school but I've always found it very hard to understand. Maybe I'm wrong but I have the feeling that relatively few engineers use these formalisms as their mental models when designing distributed systems.

It was surprising that Kleppman's book was mentioned only at the very end of the article, but at least it came with an understandable caveat. That book is incredible - although in all honesty it does require solid foundation of distributed systems to make proper sense.

Until you have personally battled with replication lag, real-life impacts of eventual consistency and distributed writes, Data-Intensive Applications feels like a dry theoretical read. If you do come across the book with the scars and lessons, it does open the world up.

The words "reading list" implied to me a list of books, article, etc, that one would go over to learn about the topic.

Can anyone familiar on the topic suggest a list? Perhaps starting with a "101" item for those that want a general understanding / scratch a curiosity itch and perhaps proceeding to more technical items for those that want to dig deep.

I feel like the article itself does a pretty good job of introducing a lot of the core topics with a short paragraph for each

Meh. This list looks like a laundry list for someone trying to write a class on distributed computing. It’s a conversation among peers, that is about beginners, not for them. If the author wished for something else they haven’t accomplished it. Far too dry.

When it first came out, Practical Parallel Rendering was noted for being about half about graphics and almost half an introduction to the concepts and problems of distributed computing.

No book mentioned how to do distributed transaction though.

DDIA does under "Distributed Transactions and Consensus" in Chapter 9: Consistency and Consensus.

atomic transactions by Nancy Lynch et al.

Is there a wiki for computer science? I feel like I have a couple books worth of knowledge on building and maintaining distributed systems that is just gonna die with me. Could try to start flushing out articles but would be helpful if others contributed

Why not contribute to Wikipedia?

There's tons of CS stuff on regular Wikipedia.

"End-to-End Argument in System Design" - Classic. Basically means: nerds, stop playing with yourselves and think about users/clients of your system.

I’d recommend checking out MIT’s Distributed Systems course. All its videos and assignments are available online and teach you everything you would need to get into these systems and go in depth in them.

The best introductory resource I have found in my career was: "Distributed Systems for Fun and Profit" by Mixu. It's about 50 pages long, and is broken down quite well.

https://book.mixu.net/distsys/single-page.html

A good list of concepts and resources. I'll just mention Andrew Tanenbaum's "Distributed Operating Systems" (Prentice-Hall 1995) which was my entry point.

I also recommend anyone interested to have a look at the Erlang/OTP ecosystem, especially for their design decisions. While the language and the platform isn't popular, the OTP team does present rich architectural patterns and ideas that can improve your design

I think "Specifying Systems" by Leslie Lamport should be on the list.

Along with "Mastering Bitcoin: Programming the Open Blockchain Book" by Andreas Antonopoulos

Warning: I haven't checked these links in forever, but here's a list of distributed systems reading lists.

https://gist.github.com/macintux/6227368

Do not like these types of books try all of them. However, Designing Data-Intensive Applications is just fantastic.

Among other things, Fred is the author of "Learn you some Erlang" which is one of the best programming books I've read. It's so obviously a labor of love.

https://learnyousomeerlang.com/

If you want to tackle it from a more practical point of view, I can also recommend "Site Reliability Engineering (How Google runs production systems)", which is not only about the method itself, but naturally goes over distributed systems and explains some fundamentals.

I've gotten a lot of value by going to a topic I don't understand on wikipedia, finding the oldest paper they cite, and printing it out.

Sometimes I only finish half the paper, but damned if I haven't learned a lot.

Disclaimer: I can could never go through and systematically work through a giant list like this. If you know yourself and you can, this may be more effective.

I haven’t followed the link - but why not put it all in one place?