return to table of content

Faster CRDTs (2021)

jzelinskie
65 replies
20h30m

What are some real world apps using CRDTs that have really good experiences?

IIRC Notion was supposed to be one of them but realistically taking notes with two people in Notion is almost unusable compared to Google Docs.

yazaddaruvala
18 replies
20h7m

In practice, Git

Some CRDT purists would say "its not perfectly conflict free so its not a CRDT".

Sure[0], but for the rest of us that are pragmatic about best effort conflict resolution Git is likely the most successful CRDT application.

[0] https://en.wikipedia.org/wiki/No_true_Scotsman

josephg
9 replies
19h19m

I really wish someone would make a git-like tool on top of CRDTs. I want conflicts when merging commits like git does, but I also want the crdt features - like better cherry-picking and cleaner merging when merging several branches together. And of course, live pair programming.

CRDTs store a superset of the information git stores - so we should be able to use that information to emit git style conflicts when we want. But I think nobody has implemented that yet. (Pijul probably comes closest.)

Groxx
5 replies
19h13m

I suspect a major reason why CRDTs haven't been a clear dominator in VCSes is that the "conflict free" decision is not necessarily the correct decision. It's merely a consistent one.

When editing is relatively "live", those are small enough that they're probably also correct. But adding your month of changes to a dozen other people's month of changes doesn't mean it's going to build correctly (or even look sane) when you change the same file. Manually seeing the issue and fixing it gives you a chance to correct that, at the time it's relevant, rather than "it all changed, good luck".

---

If you're interested in distributed-VCS systems that have some different semantics than git, https://pijul.org/ might be interesting. And Jujutsu is... complicated, as it abstracts across a few and brings in a few things from multiple VCSes, but it's mostly git-like afaict: https://github.com/martinvonz/jj

No doubt there are others too (if anyone has favorites, please post! I'm curious too)

bmacho
1 replies
6h48m

I remember jj, pijul or another CRDT-git website offering a javascript demo (I can't find it now). I tried that user 'A' removes a line, user 'B' modifies that line, and it converges to that line becoming the modifications. I don't think that automatic conflict resolving is the future.

Groxx
0 replies
6h12m

Yeah, I don't think so either. Conflicts are good - the knowledge that someone else also did something [here] is useful, and external context can completely change what the correct next step is. The info needed will never be fully encoded into the text.

That said, a clearer model for why it conflicts, and how things got there, can definitely help understand what happened and therefore what to do next. Git isn't great there, you're just given the final conflict pair.

nrr
0 replies
18h42m

Fossil (https://fossil-scm.org) is actually a proper grow-only set from what I can tell: merge conflicts upon synchronization are encoded in the DAG as forks of the history at the conflicting points in time. If no action is taken after a fork, all clones will eventually see the exact same view of the history.

The tooling will normally bail with a "would fork" message if a `fossil update` (and, hence, local resolution of those conflicts) wasn't undertaken before pushing, but this is not at all necessary to preserve useful operation.

n0w
0 replies
9h49m

CRDTs seem to give the best experience when they correctly model the "intent" of changes.

But a diff between two different states of raw text can't convey the intent of a code change (beyond very simple changes).

This is why I think CRDTs haven't caught on for VCSes and I'm not sure they _could_ without some kind of structured editing.

josephg
0 replies
11h10m

the "conflict free" decision is not necessarily the correct decision. It's merely a consistent one.

Yep. But there's no reason you couldn't build a system that supports both approaches - conflict-free when live pair programming, but conflicts are still generated when merging long lived branches. As I say, text CRDTs store all the data needed to do this. Just, nobody (as far as I know) has built that.

felipefar
0 replies
17h32m

I've been researching CRDTs for a reference manager that I'm building (https://getcahier.com), and thought that some hybrid of automatic and user-operated conflict resolution would be ideal, as you described. But current efforts are mostly directed to automatic resolution.

brynb
0 replies
1h8m

(hi seph, hope all’s well) — i did exactly this with Redwood and showcased it multiple times during Braid meetings. alas, nobody was very interested in trying to push it forward

OJFord
0 replies
16h4m

I would want the git-like tool to have semantic diffs, and so have much better conflict resolution as a result; not CRDTs and more obtuse & confused conflicts than it's already capable of.

kiitos
4 replies
19h56m

CRDT literally means Conflict-free Replicated Data Type. Expecting CRDTs to be conflict-free isn't purism, it's simple validation. Git is, inarguably, not a CRDT.

yazaddaruvala
2 replies
16h21m

CRDT literally means Conflict-free Replicated Data Type.

Git could be "conflict-free" with a simple `rand(ours, theirs)`.

It would be useless, but technically "conflict-free". Is the addition or removal of that rand function really, pragmatically the difference in the answer to "what is a CRDT?"

jakelazaroff
0 replies
2h11m

`rand` wouldn’t work because all peers must reach the same state without coordination.

Groxx
0 replies
14h31m

Adding extra rules on top of git to try to turn it into a CRDT doesn't make git one, even if you succeed (and rand would not succeed). You can do that with a pencil and paper, but that doesn't make paper a CRDT.

LAC-Tech
0 replies
14h48m

There are CDTS that have "multiple versions", which look an awful lot like conflicts to me, ie, the Multi-Value Register in this paper:

https://inria.hal.science/inria-00555588/

vlovich123
2 replies
19h50m

This comes up every time but there’s three criterion for CRDTs and git fails 2 of them. Even ignoring the requirement for automatic conflict resolution (which git can’t meet since automatic resolution fails as a matter of course) and ignoring that the conflict resolution algorithm has to be part of the data type (it’s not), it fails the requirement that separate different copies must eventually converge when brought online but that’s demonstrably false as people may use different conflict resolution mechanisms AND the commit graph of a resolved conflict may itself then be different resulting in different histories from the process of brining git online.

This is because the commit graph itself isn’t CRDT. If I commit A and then B but someone’s clone only has B applied, you can’t synchronize even automatically; different git histories don’t resolve in any way automatically at all and once you pick a solution manually your copy will not have the same history as anyone else that tries to redo your operations.

No true Scotsman doesn’t apply here because there is a very precise definition of what makes a CRDT that is a clear delineating line.

yazaddaruvala
1 replies
16h29m

1. The application can update any replica independently, concurrently and without coordinating with other replicas.

2. An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur.

3. Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.

[0]

Again, in theory it fails 2 and 3. However, in practice 3 is a normal part of working with git in a team. Barring a hard fork in git - which is equivalent to a deep copy of a CRDT. Like any deep copy of a data type, a CRDT's deep copy can be used in non-conformant manners (forks are VCS specific jargon for a CRDT deep copy; or shallow copy sometimes).

If I commit A and then B but someone’s clone only has B applied, you can’t synchronize even automatically; different git histories don’t resolve in any way automatically

Maybe I don't understand your point specifically, but this example seems entirely solved by --rebase. In practice --rebase is typical, and best described as "do your best to automatically resolve histories; I'll handle any of the complex conflicts".

All that said, I already agreed: "academically Git is not a CRDT". However, and I'm happy to disagree with you, in practice Git is the most popular CRDT.

[0] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

fragmede
0 replies
16h11m

given how easy it is to run into merge conflicts doing normal things with git, I can't say that I'd agree that in practice git is a CRDT either.

yazaddaruvala
11 replies
16h9m

What are some real world apps using CRDTs that have really good experiences?

Since people loved my other hot take :)

I suppose "good experience" can be subjective, but Blockchains are also in practice a distributed trust-less CRDT that is used by a LOT of people.

- With redundant validation, across more than 2 nodes, used to solve for conflict resolution.

- and proof of work/stake used to solve for distributed trust.

Again, academically, maybe not a CRDT because they will not "always converge". Hard forks can happen. However, pragmatically, Blockchains are a CRDT.

josephg
6 replies
11h26m

Author here

Again, academically, maybe not a CRDT because they will not "always converge". Hard forks can happen. However, pragmatically, Blockchains are a CRDT.

The term "CRDT" is an academic term. So the academic definition matters. They're not "pragmatically" a CRDT because that's simply not how we decide what is and isn't a CRDT.

The proper definition is any algorithm, replicated across multiple computers in a network with these 3 properties (at least according to wikipedia):

1. The application can update any replica independently, concurrently and without coordinating with other replicas. 2. An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur. 3. Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.

Blockchains meet criteria 2 and 3. But in my opinion, property 1 is too much of a stretch.

In a blockchain, you can't meaningfully update your local replica independently and without coordination. Sure - you can add local transactions. But blockchains care about global consensus, not the local state. A transaction in a blockchain isn't generally considered to have happened until its been included in a block & witnessed throughout the network.

This is in stark contrast to, say, Apple Notes (which uses a CRDT internally). In Notes, I can edit a note on my phone and the note changes. There's no consensus required before my new edits are considered to be legitimate.

halfcat
3 replies
4h35m

Help me understand this more clearly. At what level of abstraction does the CRDT definition apply? At the data structure level, the application level, the end user level?

We can, for instance, use a simple directed acyclic graph of immutable data types as a data structure that can handle asynchronous updates and result in eventual consistency across all computers.

We have a node in this DAG that says there’s a meeting at 1pm. Two people update this meeting time. One updates it to start at 2pm, and the other updates it to start at 3pm, and these changes happen simultaneously.

The data structure now has a tree with the original parent node (1pm), and two new child nodes with new times (2pm and 3pm). All computers are updated cleanly and contain the same updated data structure. All conflicts handled. Zero conflicts exist at the data structure level.

At the application level, the app now shows there’s a meeting “at 2pm or 3pm”. And all apps on all computers will reflect this same, consistent information.

But to the people, there is a conflict in the meaning of this data. Is the meeting at 2pm or 3pm? This is somewhat analogous to a git merge conflict, where the “what to do about it” depends on the meaning to a human.

Like I get the impression that a lot of people don’t consider the “meeting at 2pm or 3pm” to be a CRDT because it “doesn’t converge to a single value”.

But from a physics perspective, there’s some binary data, some changes happen, and the new binary data is now replicated to all end user devices in the same state.

jakelazaroff
2 replies
3h39m

The CRDT definition applies at the data structure level.

Let’s model your meeting example as a set. Any local changes would atomically remove the current meeting time and add a new one.

Multiple concurrent edits could result in two meeting times in the set. In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

Or, put slightly differently: the CRDT’s job is purely to make sure the peers can always agree on the same state. Any semantic conflicts within that state are an application-level concern.

Slightly shameless plug, but I wrote a post showing how a CRDT fits into a toy application if you want to see how the layers fit together: https://jakelazaroff.com/words/building-a-collaborative-pixe...

yazaddaruvala
1 replies
2h59m

In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

So then Git uses a CRDT internally? Merge conflicts are user experience "semantic conflicts" within the Git application.

The internal state in the Git commit tree contain all commits consistent across the synced replicas. Without further inputs, all the repos would render the exact same user experience conflict.

"The data type i.e. Git's internal state is conflict free - but there are times the user needs to correct the semantic state."

"The Git application wants to guarantee semantic state, so unless the semantic conflicts are resolved, the Git internal state i.e. 'data type' will continue to be conflict-free using the 'rollback technique' to reconcile conflicts with the local copy's 'data type'."

Roughly: git fetch would be the conflict-free update of the internal data type. While git merge would be the application level, semantic conflict resolution.

jakelazaroff
0 replies
2h28m

I don’t know enough about Git’s internals to be able to say. My gut says no; the (state-based) CRDT “rules” are that merges must be commutative, associative and idempotent, and I think if you tried to merge conflicting changes on multiple remotes in different orders, you would get different internal states. Not sure though!

yazaddaruvala
0 replies
3h27m

Sure - you can add local transactions. But ...

A transaction in a blockchain isn't generally considered to have happened until it’s been included in a block & witnessed throughout the network.

For the sake of argument:

- It’s not meaningful for Apple Notes’ local edits to only be local. Sure they will render, but if a tree falls and no one hears is, did it make a sound? So let’s ignore local edits.

- A sentence in Apple Notes isn’t generally considered to have happened until it’s been rendered in every client and users can see it if they look.

Is that any less fair of a statement?? Is this game of nuanced wording selection at all meaningful tho to answer the question “Is Apple Notes a popular application of CRDTs?”

Meanwhile, keep in mind in a blockchain entire sets of nodes could be firewalled from each other for days, each subsection performing consensus and creating blocks. Days later the firewall could collapse and the nodes will need to automatically reconcile. If I abstract away the internal replicas and say the two “offline parts” of the CRDT need to keep “local” state and reconcile once both/all parts come back online. Would this abstraction help meet the strict definition?

I know you probably still strongly disagree - but coming from a non-academic background these word games around CRDTs hurt rather than help people’s understanding and use of them.

Instead we should be asking, how can we improve Git as a CRDT (where perfect "conflict-free" is the enemy of good enough) to better automerge more conflicts? How can we improve the general theory of CRDTs based on what we learned from implementations in “video game A”, “blockchain Z”, and “web application framework J”?

espadrine
0 replies
6h25m

I would say blockchains fit this definition. What I get away from that is that this definition is not restrictive enough.

Similarly, a list of operations with Lamport timestamps fits this definition. However, just like the blockchain, it makes no effort to preserve user intent.

To go to an extreme, a merge operation which replaces the current content with the empty string, is a CRDT per this definition.

plesiv
1 replies
12h41m

The cat runs, eats and has fur. The cat is a dog.

jffhn
0 replies
2h58m

"Plato had defined Man as an animal, biped and featherless, and was applauded. Diogenes plucked a fowl and brought it into the lecture room with the words, "Here is Plato's man."" (Diogenes Laertius, Lives of Eminent Philosophers)

EGreg
1 replies
16h4m

I think blockchains are overkill because they're trying to get a global consensus about every transaction in the world, which is very expensive (with electricity, stake etc.). They're trying to solve the additional problem of preventing "forking" of a history. Whereas if you allow forks then CRDTs are more lightweight.

Internet Computer canisters, Merkle DAGs (like IOTA), Hashgraphs and Holochains are probably a better architecture than Blockchains.

Having said that, I think Merkle Trees are the future, for minimizing the amount of sync. They let you see exactly what changed, and so on.

https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

https://github.com/canvasxyz/okra-js

There's already great software done by Dat (later called Hypercore, now called Holepunch) that has an append-only log based on the SLEEP protocol. That's essentially all you need, in my opinion. They've built everything from key-value databases (hyperbee) to full-scale file systems (hyperdrive) on top of it. Beaker browser has been built on top of it.

I mean, merging state-based CRDTs is cool, but you may as well just use operation-based CRDTs with something like Hypercore or the older PouchDB.

maccard
5 replies
20h0m

Doesn’t google docs use crdt?

surfmike
4 replies
19h58m

I believe it uses Operational Transforms

EGreg
2 replies
15h53m

OT vs CRDT: https://stackoverflow.com/questions/26694359/differences-bet...

In my opinion, the problem with CRDTs is they don't really have a great concept of access control and enforcing rules. Since everyone's a peer, you can only do things like "last write wins" etc.

In a video game, for instance, I could disconnect and then spam the peers with tons of updates, claiming they were all done in the past, and they'd have to accept them all.

PouchDB and IPFS lets nodes sort of make decisions about who to replicate from, but it's rudimentary: https://pouchdb.com/guides/replication.html

Hypercore is much better because it is an append-only log where every chunk is signed: https://docs.pears.com/how-tos/replicate-and-persist-with-hy...

However, it's still not enough, because hypercore only supports a single-writer. There has been some work on hypercore multiwriter in the last few years (which I followed with interest) but none of it is Byzantine fault-tolerant, and it was ultimately dropped. In fact, if you leak the key, anyone you can probably corrupt a hypercore rather easily and prevent it from making forward progress.

This is why at the end of the day, you still go back to the ideas of preventing forks. The problem is that blockchains do it via global consensus, which is hugely wasteful and overkill. But you do need something to run essentially "smart contracts", i.e. software that runs "in the cloud" which is infeasible to tamper with. Like Internet Computer's canisters for example... https://internetcomputer.org/docs/current/concepts/canisters...

Ultimately, these architectures will end up being far more reliable and resilient than the Web, because the Web is based around hosting and location ("where" something is) rather than content ("what" something is). IPFS and cids are a step towards that future, but other projects like ICP and Freenet are leapfrogging them... they have smart contracts based around WebAssembly.

Incidentally, I interviewed Ian Clarke about the original freenet, and he recently announced the new freenet, which is based around webassembly, and the swarming is done using "small-world topology" instead of kademlia DHT: https://www.youtube.com/watch?v=yBtyNIqZios

(Here is my interview with him a couple years ago: https://www.youtube.com/watch?v=JWrRqUkJpMQ)

josephg
1 replies
11h15m

Author here. I had a chat with PHV about this (a principle author of the Local First Software paper).

He sees the access control problem like scribbling on your copy of the US Constitution. If you print out the constitution, you can make whatever marks on the page that you want. But nobody else is obliged to copy your changes.

Its the same as Git in that sense. I can fork a repository and submit pull requests. But I can only push my changes upstream if I have write access to the repository.

If you need a single authoritative "truth" - like a video game - then you can still do that by having one peer act as a hub and it can decide which edits to merge & broadcast. If a video game used a CRDT, you might still want a server - and that server might still want to reject any incoming edits that are older than a few seconds.

But I generally agree - I don't think CRDTs add a lot of value in multiplayer video games. CRDTs make sense when a local user wants to edit some data, and those changes are useful locally even if you never share them with other peers. So, creative work is the use case I think about the most - like video editing, writing, design, programming, and so on.

In the video game industry, I think the best use for CRDTs would be to make Unity / Unreal / etc collaborative spaces.

maccard
0 replies
9h11m

In the video game industry, I think the best use for CRDTs would be to make Unity / Unreal / etc collaborative spaces.

Unreal has a collaborative implementation - https://www.youtube.com/watch?v=MPIpOdNmNGE It's demo'ed here for Virtual Production. I believe that it's called the "concert framework" in the engine itself. I'm unsure what it's implemented based on though.

maccard
0 replies
9h27m

Ah, thank you.

yazaddaruvala
4 replies
11h22m

Thinking about it more, I’ll provide another that people will not immediately appreciate.

Any and all networked video games with some form of rollback or correction. Best effort with a fallback to Rollback might actually be the “best” ie ergonomic experience for CRDTs that are most widely used.

Again, not academically a CRDT because technically game state is not perfectly replicated to every client. Each client only gets partial game state.

Additionally, game clients require low latency syncing, which could academically be considered “coordinating”. Even tho the client actually accepts and renders the input’s results locally probabilistically before any conflict resolution / rollback is returned to the client for correction.

Again people are likely going to be pedantic but with three post now, I’d like to hope, people might see the common theme:

The most popular, highly ergonomic, best implementations of CRDTs actually break the academic rules of CRDTs.

This is a relatively typical trap of an overly academic mental model. Most real world algorithms and data types are actually more creative than their academic “rulesets”. eg Timsort.

Especially if you’re building a product for actual use (as apposed to for review in a paper), then don’t fall into the over engineered/academic trap. Be creative, learn the academic rules, then intentionally break those rules, build what actually adds value and make it ergonomic rather than try to perfectly implement a concept that academics defined so stringently it’s only useful for other academics.

sorrythanks
3 replies
6h16m

The most popular, highly ergonomic, best implementations of CRDTs actually break the academic rules of CRDTs.

There's a popular, highly ergonomic implementation called Automerge[0] that would beg to disagree with you.

[0]: https://automerge.org/

yazaddaruvala
2 replies
3h55m

FWIW, never heard of it before but I like it.

Meanwhile, a lot of people on these threads would say Automerge is not a CRDT. By definition Automerge.getConflicts means Automerge has conflicts and is not conflict-free.

I on the other hand prefer and applaud Automerge for breaking the rules! I hope it continues to see more success.

vlovich123
1 replies
44m

I think you are grossly misunderstanding what does and does not make a CRDT. The existence of conflicts is fine - they just need to be solved automatically and in an eventually consistent manner so that all clients come up into the same state once they’re all online. “Conflict-free” refers to there not being any unsolvable conflicts or ambiguity in solving those conflicts. That’s why git isn’t but automerge is:

The values of properties in a map in automerge can be conflicted if there are concurrent "put" operations to the same key. Automerge chooses one value arbitrarily (but deterministically, any two nodes who have the same set of changes will choose the same value) from the set of conflicting values to present as the value of the key.

Sometimes you may want to examine these conflicts, in this case you can use getConflicts to get the conflicts for the key.

Basically it’s saying may be you want to look at other “older” values that may have been observed even though the conflict was resolved (eg showing to the user previous copies of the values).

yazaddaruvala
0 replies
40m

https://automerge.org/docs/documents/conflicts/

"The only case Automerge cannot handle automatically, because there is no well-defined resolution, is when users concurrently update the same property in the same object (or, similarly, the same index in the same list). In this case, Automerge picks one of the concurrently written values as the "winner", and it ensures that this winner is the same on all nodes"

"The object returned by getConflicts contains the conflicting values, both the "winner" and any "losers". You might use the information in the conflicts object to show the conflict in the user interface."

To me, this seems identical to how git works. Specifically git fetch (does automatic resolution storing all of the changes), vs git merge (showing the conflicts in the user interface).

minkles
3 replies
19h3m

Most of iCloud's services use CRDTs underneath I believe. That includes Notes, Reminders and possibly Photos as well. FoundationDB is some of the backend as well. I was told this by a drunken former Apple SRE in a bar :)

octopoc
0 replies
1h44m

The other day I saw a bug in it, when editing a Note that another person was also editing. We were editing separate paragraphs. My cursor kept resetting to the beginning of a word after each letter I typed (and the Swype-style keyboard had the same problem). My point is, even Apple Notes doesn't get it quite right.

chipdart
3 replies
4h56m

What are some real world apps using CRDTs that have really good experiences?

Google Wave and Google Doc's use CRDTs.

There's a fun blog post or two on how an academic paper completely botched it's benchmarks on the underlying algorithms by putting together a piss-poor implementation of Google Wave's algorithm and proceeding to claim the algorithm itself sucked even though the piss-poor implementation was such a mess it outperformed everything from every angle..

vlovich123
0 replies
48m

Google docs also uses OT not crdt. OT is easier to develop by an order of magnitude and even offline mode is possible to implement in a good enough way.

tkone
0 replies
47m

Google docs was originally ot based as well. I'm not sure about the current state of it.

jitl
2 replies
19h22m

Today Notion is a last-write-wins system with limited intention-preserving operations for list data (like block ordering). Text is last-write-wins, each block text or property is a last-write-wins register. We're working on a new CRDT format for block text.

felipefar
1 replies
17h58m

Do you use last-write-wins using the received order of operations on the server or using a logical clock?

I believe that for Notion the collaboration use case is more important than async editing, so a CRDT makes more sense, but for text editing apps that favor asynchronous collaboration maybe explicit conflict resolution is more reasonable.

jitl
0 replies
17h46m

No clocks on the write side

jdvh
1 replies
18h57m

Thymer[1] uses CRDTs for everything. It's an IDE for tasks and planning. It's a multiplayer app, end-to-end encrypted and offline first, optionally self-hosted, and an entire workspace is a single graph. So CRDTs were the logical choice.

All operations in Thymer get reduced to a handful of CRDT transformations. It doesn't matter whether you are moving or copying text, changing "frontmatter" attributes, dragging cards, uploading files, or adding tags. It's all done with the same handful of CRDT operations. Although this was a lot of extra work up front (we're not using any libraries) the benefits make it totally worth it. When your application state is a single graph you can move text between pages, link between pages (with backlinks), have transclusions, and do all sorts of cool stuff without having to worry about synchronization. CRDTs guarantee that all clients converge to the same state. And because CRDTs are by their nature append-only you get point-in-time versioning for free! We did end up having to make a couple of compromises for performance, though. Version history is not available offline (too much data) and in some cases we resort to last-writer-wins conflict resolution. On balance I think CRDTs are very much worth it, especially if you design an app with CRDTs in mind from day one. I probably wouldn't use CRDTs if I had to retrofit multiplayer in a more conventional AJAX app. Mutations in CRDTs are first applied optimistically, and then when the authoritative sequence of events is determined all clients need to revert their state to the last shared state and then re-apply all events in the correct order (thereby guaranteeing that all clients end up in the same state). Sometimes your app might need to revert and re-apply days worth of changes if you've been offline for a while. This all happens behind the scenes and the user doesn't know how many tree transformations are happening in the background but I guess my point is that CRDTs affect the design of the entire application. Most apps that are popular today were designed back when CRDT transformations were not yet well understood.

[1] https://thymer.com (almost ready for beta)

pnw
0 replies
18h40m

This looks really cool, signed up for the beta!

tlarkworthy
0 replies
11h19m

It's not a CRDT, it's LWW at the cell level, transmitted pretty much a typing speed so conflicts are low

tin7in
0 replies
9h17m

Saga: https://saga.so is a collaborative workspace with a strong focus on speed/simplicity.

spacecadet
0 replies
20h14m

Not a public high bandwidth example, but a consulting team I was on implemented CRDTs in a private mobile app used for a compliance documentation use case. More of a precaution, the actual number of end-users editing a record at any given time is expected to be 1, but while reviewing their system, we found that the clients system can support multiple and sometimes their employees don't communicate/follow procedures. We first tried to convince them to make changes on their end, but their development team straight up refused.

leonseled
0 replies
13h48m

Muse. I love how I can write something on my iPad with the pencil and see the strokes real time on the mac app.

I then switch to my mac to drag images and screenshots and see it synced to the ipad in real time. Then i annotate w the apple pencil.

One of the best “whiteboarding” tools I’ve used.

darby_nine
0 replies
4h53m

I mean, two people contending over one piece of state is never going to be a good experience. Thankfully this is mostly not useful in the first place.

danielvaughn
0 replies
20h15m

Figma uses a CRDT-ish approach. I'm working on one at the moment, but it'll be a while before it's publicly available. At the moment, I'm using Loro: https://loro.dev

WolfOliver
0 replies
5h7m

I'm the creator of MonsterWriter. MonsterWriter uses a very simple CRDT for JSON payload (references index, export config). For the collaboration on the actual text it uses OT. When I started to implement this 7 years ago. CRDT weren't a thing.

LAC-Tech
0 replies
14h52m

I think a lot of things. CRDTs predate the term "CRDT", they seem to be a mathematical construct/pattern found in the wild, that smart people codified and formalised.

IME CRDTs just lay out the rules you need to follow for syncing eventually consistent nodes properly, ie without weird surprises or data loss.

As for concrete examples:

- in the famous Amazon Dynamo Whitepaper, they perfectly describe a CRDT in section 4.4. (https://www.allthingsdistributed.com/files/amazon-dynamo-sos...), though I don't think this made it to modern Dynamo DB

- I believe CouchDB follows all the CRDT laws

- TomTom uses them

I think the TL:DR is if you're going to sync two data stores that can be written to without coordination, learn the CRDT laws and make sure you follow them. They're fairly simple mathematical properties at the end of the day.

felipefar
11 replies
18h3m

CRDTs are powerful, but it's unfortunate that they leave behind a trail of historical operations (or elements), both in their ops- or state-based variants. Even with compression, it's still a downside that makes me concerned about adopting them.

Even so, the discussion surrounding them made me excited by the possibility of implementing conflict-free (or fine-grained conflict resolution) algorithms over file-based storage providers (Dropbox, Syncthing, etc.).

jakelazaroff
2 replies
17h54m

If you’re not building something fully decentralized, you may be able to loosen some of the constraints CRDTs require. As an example, if you can guarantee that all clients have received changes later than X date, you can safely drop any operations from before that date.

shipp02
1 replies
11h31m

Could you also make it fully decentralized but require clients to come online within a deadline (1 day, week) or risk losing their local changes? This would also allowing trimming history but with loss of some functionality to sync.

jakelazaroff
0 replies
5h56m

Yes, you can go that direction as well, although in a decentralized architecture there’s no shared notion of “coming online”.

For example:

1. you have four peers collaborating on a document

2. for some extended period peer A only communicates with peer B and peer C only communicates with peer D (and vice versa)

3. the peers truncate the operations at some point within that period

4. each pair of peers now has a document irreconcilable with the other even though all peers “came online” recently

eviks
2 replies
14h41m

What's the concern if you can delete history?

orthecreedence
1 replies
12h26m

In a distributed system, which is often a place CRDTs thrive, deleted history means a new client cannot be brought up to the current state unless there is some form of state summarization. Doing this summarization/checkpointing in a consistent manner is difficult to do without some form of online consensus mechanism, which is also difficult to do in a distributed system with clients popping in and out all the time.

josephg
0 replies
11h0m

It depends on the system. Some approaches (like Greg Little's Shelf or my Eg-walker algorithm for text) make this trivial to implement.

josephg
1 replies
11h2m

Author here. I've had this conversation with people a lot. And people in the CRDT world also talk about it a lot. But in practice, at least with text editing, the overhead is so tiny that I can't imagine it ever coming up in practice. Diamond types - my post-CRDT project - does (by default) grow without bound over time. But the overhead is usually less than 1 byte for every character ever typed. If I turn on LZ4 compression on the stored text, documents edited with diamond types are often smaller than the resulting document state. Even though we store the entire editing history!

I know a bunch of ways to solve this technically. But I'm just not convinced its a real problem in most systems.

(I have heard of it being a problem for someone using yjs for a 3d modelling tool. While dragging objects, they created persistent edits with each pixel movement of the mouse. But I think its smarter to use ephemeral edits for stuff like that - which aren't supported by most crdt libraries.)

Git also suffers from this problem, by the way. Repositories only grow over time. And they grow way faster than they would if modern CRDT libraries were used instead. But nobody seems bothered by it. (Yes, you can do a shallow clone in git. But almost nobody does. And you could also do that with CRDTs as well if you want!)

Vinnl
0 replies
5h46m

I don't think the concern was necessarily about overhead, but about sensitive data. This is a problem in Git too: people make the mistake of reverting a commit with sensitive values and think it's gone, but the history is still out there.

Edit: or maybe that was the concern, but this other concern exists too :)

jchanimal
1 replies
16h51m

The full op-log plus deterministic merging is a great fit for immutable block storage, which can have other security, performance, and cost benefits. I'm building Fireproof[1] to take advantage of recent research in this area. An additional benefit to content addressing the immutable data is that each operation resolves to a cryptographically guaranteed proof (or diff), enforcing causal consistency and allowing stable references to be made to snapshots. This means your interactive, offline-capable, losslessly-merging database can run on the edge or in the browser, but still have the integrity you'd have looked to a centralized database or the blockchain for in the past. (Eg you can drop a snapshot CID into a PDF for signature, or a smart contract, and remove all ambiguity about the referenced state.

[1] https://github.com/fireproof-storage/fireproof

jitl
0 replies
15h40m

how do such immutable systems deal with redaction eg for GDPR delete requests? Do you need to repack the whole history, and break the existing signature chain?

LAC-Tech
0 replies
14h47m

CRDTs are powerful, but it's unfortunate that they leave behind a trail of historical operations (or elements), both in their ops- or state-based variants. Even with compression, it's still a downside that makes me concerned about adopting them.

There's nothing inherent in the concept of a CRDT that requires you to leave behind a trail of historical operations or elements.

You'd be better off directing your criticism and specific implementations than making this blanket statement about what is, at the end of the day, a set of mathematical laws that certain data types / databases follow.

ericyd
4 replies
19h24m

Why is WASM 4x slower than native execution?

I thought it was because every string operation had to be copied into WASM memory and then back into JS when the result was computed. Am I wrong? Am I misunderstanding the context? Genuinely curious!

josephg
1 replies
19h7m

Author here. This post was from a few years ago but from memory I controlled for that. So the problem wasn't FFI.

I loaded the whole history into wasm before I started timing, then processed it in an inner loop that was written in rust, running in the wasm context itself. There were only 2 calls to wasm or something. The 4x slowdown wasn't FFI. The algorithmic code itself was actually running 4x slower.

It'd be interesting to rerun the benchmark now. I assume compilers are better at emitting wasm, and wasm runtimes have gotten faster. I'm sure I've still got the benchmarking code around somewhere.

ericyd
0 replies
6h3m

Interesting, thanks for the reply!

echelon
0 replies
19h16m

That strikes me as the likely plausible culprit.

The one that keeps tripping me up in unrelated domains is that the multithreading story is not easy or fully supported by libraries and tooling. We've run game engines and utility binaries (ffmpeg, zip, etc.) in the browser and they're super slow because of this.

Validark
0 replies
9h12m

I think the better question is why would someone expect them to be the same? I haven't worked on a WASM interpreter or JIT, but how often is it better to go through multiple layers of translation instead of one? Translating high level code to WASM, or any assembly language, makes you lose a lot of the "intent" embedded in the higher level code. In lower-level code, you often see a series of language-specific idioms for accomplishing the goal, that may or may not have direct equivalents on your actual machine. In the case of modern x86-64, you have a ton of instructions that are far more powerful than what you can do in WASM.

Decompilers exist of course, and maybe a list of macro-op fusions exists where a WASM JIT can do a relatively simple pattern match and get good native code (probably not though, and good luck with cross-platform optimization). And, LLVM is definitely not perfect, there's definitely low-hanging fruit where a post-processing optimizer could make improvements. So it is theoretically possible to make WASM faster than LLVM's native emit by doing something equivalent to an optimization step LLVM should have but doesn't at the moment.

But I highly, highly doubt that you'd get just as good results without some seriously well-laid plans, which I doubt exist, or by creating a set of instructions which is, in effect, a superset of what target ISAs support. From what I've seen, it's a subset, so good luck trying to canonicalize the operations and merge them back together in real time. Not entirely impossible of course, but it would be a serious feat of engineering.

Intuitively, if I wrote a book in English, then it got translated to a very different language that's also limited to only a few thousand words, then translated back to English, it wouldn't be exactly the same text. There would be concepts that have to be explained with a paragraph that, in English, would only take a single word. Getting my English back out would require a 1-to-1 translation of everything, or a list of paragraph-to-1 translations that both translators agree upon.

pjz
3 replies
17h59m

And why 32 entries? I ran this benchmark with a bunch of different bucket sizes and 32 worked well. I have no idea why that worked out to be the best.

If you were using 2-byte ints, this is likely because cache lines are 64 bytes, so 32 entries would be exactly one cache line, letting each cache line hold an entire bucket, thus reducing those expensive main memory transfers.

VHRanger
1 replies
17h16m

Yeah when benchmarking by batch sizes it's common to see huge jumps associated with the memory hierarchy:

- word size (64bits) - cache alingment fetch size (generally 64bytes as mentioned above) - OS page size (4-16kb) - L1 size (~80kb/core) - L2 (low megabyte number)

hinkley
0 replies
14h24m

With lots of bizarre artifacts if you don’t force alignment.

taeric
0 replies
2h38m

I really like the way Knuth benchmarks many of his later programs. He basically puts a counter for how many times something has to be loaded from memory. Would be curious to know if you could approximate how many times you have to clear cache lines, in the same way?

IshKebab
3 replies
21h1m

(2021) and Automerge's Rust implementation seems to have landed so it would be interesting to see an updated benchmark.

josephg
2 replies
19h38m

Author here. Yjs has also been rewritten in rust (yrs) and it’s significantly faster than the JavaScript version.

I’ve also got a new, totally different approach to solving this problem too.

It would definitely be good to update the benchmarks. Everything has gotten faster.

demircancelebi
1 replies
18h48m

would love to read the totally different approach

og2023
0 replies
18h13m

Absolutely support this, so do I!

luke-stanley
2 replies
21h12m

Could be good to have the date put with the title?

gchamonlive
0 replies
20h1m

July 31 2021

dang
0 replies
19h41m

Added above.

fredrikholm
2 replies
21h24m

I remember stumbling over this post a few years ago. Really entertaining post, one of my favorites in recent years.

anaclet0
1 replies
21h9m

IIRC the title was CRDTs go brrr

josephg
0 replies
19h24m

Author here. CRDTs go brrr was my working title and it’s still in the url. I should probably rename it back to that - so many people latch on to that title anyway. The meme value is strong.

kmoser
1 replies
17h27m

This is one of those rare articles which, although much of the material is over my head, I couldn't stop reading because it's written so well.

josephg
0 replies
11h9m

Author here. Very kind :)

JohnDeHope
1 replies
21h9m

Yeah, new rule: I don't believe anything in a published scientific paper until it has been independently verified for the third time. I don't even want to hear about it, before then, unless I read the journal the original (or second) paper was published in. What I'd really like, and would subscribe to even as a lay person, is the JOURNAL OF STUDIES WHOSE FINDINGS HAVE BEEN SUCCESSFULLY REPRODUCED FOR THE THIRD TIME. I'd pay for a subscription to that.

lylejantzi3rd
0 replies
19h49m

Me too.

josephg
0 replies
10h58m

Can someone explain to me please why CRDTs are slow?

Author here. The main reason was that a lot of CRDT libraries were written by academics, who don't have the time, skill or interest in optimising them.

Since writing this article a few years ago, all the major CRDT libraries have gotten orders of magnitude faster.

riedel
0 replies
11h46m

Quoting the current github Readme [0]: >And since that blog post came out, performance has increased another 10-80x (!).

[0] https://github.com/josephg/diamond-types

arkh
0 replies
11h4m

Seeing the hierarchical structure used I wonder if they tried using nested set instead. No idea if a possible gain in read operation would offset the losses in insertions.