return to table of content

Jepsen: Datomic Pro 1.0.7075

adrianco
26 replies
22h0m

I was a fly on the wall as this work was being done and it was super interesting to see the discussions. I was also surprised that Jepsen didn’t find critical bugs. Clarifying the docs and unusual (intentional) behaviors was a very useful outcome. It was a very worthwhile confidence building exercise given that we’re running a bank on Datomic…

belter
10 replies
20h24m

I was also surprised that Jepsen didn’t find critical bugs.

From the report..."...we can prove the presence of bugs, but not their absence..."

cdchn
4 replies
18h17m

"Absence of evidence is not evidence of absence."

andreareina
1 replies
17h8m

If you've looked, it is. The more and the better you look, the better evidence it is.

kelseyfrog
0 replies
16h36m

If you run it through bayes theorem, it adjusts the posterior very little.

nine_k
0 replies
2m

s/evidence/proof/.

Evidence of absence ("we searched really carefully and nothing came up") does update the Bayesian priors significantly, so the probability of absence of bugs can now be estimated as much higher.

kelseyfrog
0 replies
16h38m

Thank you. I've updated my initial guess of p(critical bugs | did not find critical bugs) from 0.5 to 0.82 given my estimate of likelihood and base rates.

jupp0r
3 replies
17h21m

In practical terms, if you are a database and Jepsen doesn't find any bugs, that's as much assurance as you are going to get in 2024 short of formal verification.

stuarthalloway
1 replies
3h8m

Formal verification is very powerful but still not full assurance. Fun fact: Testing and monitoring of Datomic has sometimes uncovered design flaws in underlying storages that formal verification missed.

nine_k
0 replies
5m

What kind of flaws? I would expect performance problems.

pests
0 replies
14h29m

The work antithesis has been doing here has me really excited as well.

vasco
0 replies
19h50m

That's consistent with the usual definition of "finding" anything.

killingtime74
6 replies
19h22m

Did you not do this work yourself before you started running the bank on it?

cdchn
5 replies
18h16m

I doubt any organization that isn't directly putting lives on the line are testing database technology as thoroughly and competently as Jepsen. Banks jobs are to be banks, not be Jepsen.

killingtime74
4 replies
15h11m

I would have thought they would be more rigorous, since mistakes for them could threaten the very viability of the business? Which is why I assume most are still on mainframes. (Never worked at a bank)

raverbashing
0 replies
5h22m

Banks are not usually ran by people who go for the first fad.js they see ahead; they usually also can think ahead further than 5 min.

Also, I'm sure they engineer their systems so that every operation and action is logged multiple times and have multiple redundancy factors.

A main transaction DB will not be a "single source of truth" for any event. It will be the main source of truth, but the ledger you see in your online bank is only a simplified view into it.

harperlee
0 replies
13h35m

Banks exist since a long time before computers existed, and thus have ways to detect and correct errors that are not purely technological (such as double entry bookkeeping, backups, supporting documentation, different processes). So a bank can survive a db doing nasty things on a low enough frequency such that is not detected beforehand, so they don’t need to “prove in coq” that everything is correct.

cdchn
0 replies
11h40m

Mistakes don't threaten them that much. When Equifax (admittedly not a bank) can make massive negligent fuckups and still be a going concern there isn't much heat there. Most fuckups a bank make can be unwound.

Foobar8568
0 replies
13h23m

Anyone who has worked in a bank and is glad of its solutions is either a fool, clueless or politician.

Banks have to answer to regulation and they do by doing the bare minimum they can get away with.

fiatjaf
5 replies
8h41m

What bank is that, if I may ask?

swah
3 replies
8h26m

First brazilian fully digital bank, got pretty big in a decade.

I'd love to hear the story from the first engineers, how they got support for this, etc. They never did tech blog posts though...

jonahbenton
0 replies
6h12m

There are some videos, both of the start and of their progress. Some of the most impressive work I have ever seen, remarkable.

SOLAR_FIELDS
1 replies
2h51m

Given that Rich Hickey designed this database the outcome is perhaps unsurprising. What a fabulous read - anytime I feel like I’m reasonably smart it’s always good to be humbled by a Jensen analysis

nine_k
0 replies
6m

A good design does not guarantee the absence of implementation bugs. But a good design can make introducing bugs harder / less probable. This must be the case, and then it's a case to study and maybe emulate.

amgreg
24 replies
23h21m

It struck me that Jepsen has identified clear situations leading to invariant violations but Datomic’s approach seems to have been purely to clarify their documentation. Does this essentially mean the Datomic team accepts that the violations will happen, but don’t care?

From the article:

From Datomic’s point of view, the grant workload’s invariant violation is a matter of user error. Transaction functions do not execute atomically in sequence. Checking that a precondition holds in a transaction function is unsafe when some other operation in the transaction could invalidate that precondition!
stuarthalloway
9 replies
22h4m

As Jepsen confirmed, Datomic’s mechanisms for enforcing invariants work as designed. What does this mean practically for users? Consider the following transactional pseudo-data:

[

[Stu favorite-number 41]

;; maybe more stuff

[Stu favorite-number 42]

]

An operational reading of this data would be that early in the transaction I liked 41, and that later in the transaction I liked 42. Observers after the end of the transaction would hopefully see only that I liked 42, and we would have to worry about the conditions under which observers might see that 41.

This operational reading of intra-transaction semantics is typical of many databases, but it presumes the existence of multiple time points inside a transaction, which Datomic neither has nor wants — we quite like not worrying about what happened “in the middle of” a transaction. All facts in a transaction take place at the same point in time, so in Datomic this transaction states that I started liking both numbers simultaneously.

If you incorrectly read Datomic transactions as composed of multiple operations, you can of course find all kinds of “invariant anomalies”. Conversely, you can find “invariant anomalies” in SQL by incorrectly imposing Datomic’s model on SQL transactions. Such potential misreadings emphasize the need for good documentation. To that end, we have worked with Jepsen to enhance our documentation [1], tightening up casual language in the hopes of preventing misconceptions. We also added a tech note [2] addressing this particular misconception directly.

[1] https://docs.datomic.com/transactions/transactions.html#tran...

[2] https://docs.datomic.com/tech-notes/comparison-with-updating...

aphyr
4 replies
21h50m

To build on this, Datomic includes a pre-commit conflict check that would prevent this particular example from committing at all: it detects that there are two incompatible assertions for the same entity/attribute pair, and rejects the transaction. We think this conflict check likely prevents many users from actually hitting this issue in production.

The issue we discuss in the report only occurs when the transaction expands to non-conflicting datoms--for instance:

[Stu favorite-number 41]

[Stu hates-all-numbers-and-has-no-favorite true]

These entity/attribute pairs are disjoint, so the conflict checker allows the transaction to commit, producing a record which is in a logically inconsistent state!

On the documentation front--Datomic users could be forgiven for thinking of the elements of transactions as "operations", since Datomic's docs called them both "operations" and "statements". ;-)

stuarthalloway
2 replies
20h58m

Mea culpa on the docs, mea culpa. Better now [1].

In order for user code to impose invariants over the entire transaction, it must have access to the entire transaction. Entity predicates have such access (they are passed the after db, which includes the pending transaction and all other transactions to boot). Transaction functions are unsuitable, as they have access only to the before db. [2]

Use entity predicates for arbitrary functional validations of the entire transaction.

[1] https://docs.datomic.com/transactions/transactions.html#tran...

[2] https://docs.datomic.com/transactions/transaction-functions....

lgrapenthin
1 replies
18h49m

Somewhat unrelated ad docs: It appears that "Query" opens a deadlink

JB024066
0 replies
4h45m

Thanks for the report! just fixed the link.

Voultapher
0 replies
20h39m

The man the myth the legend himself. I haven't ceased to be awed by how often the relevant person shows up in the HN comment section.

Loved your talks.

puredanger
3 replies
21h15m

Datomic transactions are not “operations to perform”, they are a set of novel facts to incorporate at a point in time.

Just like a git commit describes a set of modifications, do you or should you want to care about which order or how the adds, updates, and deletes occur in a single git commit? OMG no, that sounds awful.

The really unusual thing is that developers expect intra-transaction ordering to be a thing they accept from any other database. OMG, that sounds awful, how do you live like that.

cdchn
1 replies
18h12m

Do developers not expect intra-transaction ordering from within a transaction?

kccqzy
0 replies
17h7m

It depends on the previous experience of said developers, and such expectation varies widely.

voganmother42
0 replies
20h12m

Nested transactions or savepoints also exist in other systems

aphyr
9 replies
23h6m

Yeah, this basically boils down to "a potential pitfall, but consistent with documentation, and working as designed". Whether this actually matters depends on whether users are writing transaction functions which are intended to preserve some invariant, but would only do so if executed sequentially, rather than concurrently.

Datomic's position (and Datomic, please chime in here!) is that users simply do not write transaction functions like this very often. This is defensible: the docs did explicitly state that transaction functions observe the start-of-transaction state, not one another! On the other hand, there was also language in the docs that suggested transaction functions could be used to preserve invariants: "[txn fns] can atomically analyze and transform database values. You can use them to ensure atomic read-modify-update processing, and integrity constraints...". That language, combined with the fact that basically every other Serializable DB uses sequential intra-transaction semantics, is why I devoted so much attention to this issue in the report.

It's a complex question and I don't have a clear-cut answer! I'd love to hear what the general DB community and Datomic users in particular make of these semantics.

refset
5 replies
19h39m

I don't know whether it was intentional or not, but IIRC DataScript opted for sequential intra-transaction semantics instead.

stuarthalloway
2 replies
5h35m

It is worth noting here that Datomic's intra-transaction semantics are not a decision made in isolation, they emerge naturally from the information model.

Everything in a Datomic transaction happens atomically at a single point in time. Datomic transactions are totally ordered, and this ordering is visible via the time t shared by every datom in the transaction. These properties vastly simplify reasoning about time.

With this information model intermediate database states are inexpressible. Intermediate states cannot all have the same t, because they did not happen at the same time. And they cannot have different ts, as they are part the same transaction.

refset
1 replies
3h35m

Thank you for the explanations. Do you happen to know why transactions ("transaction requests") are represented as lists and not sets?

stuarthalloway
0 replies
2h54m

When we designed Datomic (circa 2010), we were concerned that many languages had better support for lists than for sets, in particular list literals and no set literals.

Clojure of course had set literals from the beginning...

huahaiy
0 replies
17h5m

Correct. I don't know about DataScript's intention, but it is intentional for Datalevin, as we have tests for sequential intra-transaction semantics.

aaroniba
0 replies
33m

Yes. Perhaps this is a performance choice for DataScript since DataScript does not keep a complete transaction history the way Datomic does? I would guess this helps DataScript process transactions faster. There is a github issue about it here: https://github.com/tonsky/datascript/issues/366

nickpeterson
2 replies
22h57m

I feel like “enough rope to shoot yourself” is kind of baked into any high power, low ceremony tool.

stuarthalloway
1 replies
21h24m

As a proponent of just such tools I would say also that "enough rope to shoot(?) yourself" is inherent in tools powerful enough to get anything done, and is not a tradeoff encountered only when reaching for high power or low ceremony.

nickpeterson
0 replies
6h17m

I always loved the broken phrase because it implies something really went terribly wrong ;)

aaroniba
2 replies
13h17m

I think the article answers your question at the end of section 3.1:

"This behavior may be surprising, but it is generally consistent with Datomic’s documentation. Nubank does not intend to alter this behavior, and we do not consider it a bug."

When you say, "situations leading to invariant violations" -- that sounds like some kind of bug in Datomic, which this is not. One just has to understand how datomic processes transactions, and code accordingly.

I am unaffiliated with Nubank, but in my experience using Datomic as a general-purpose database, I have not encountered a situation where this was a problem.

aphyr
1 replies
12h30m

This is good to hear! Nubank has also argued that in their extensive use of Datomic, this kind of issue doesn't really show up. They suggest custom transaction functions are infrequently written, not often composed, and don't usually perform the kind of precondition validation that would lead to this sort of mistake.

aaroniba
0 replies
43m

Yeah, I've used a transaction functions a few times but never had a case where two transaction functions within the same d/transaction ever interacted with each other. If I did encounter that case, I would probably just write one new transaction function to handle it.

SoftTalker
0 replies
22h36m

Sounds similar to the need to know that in some relational databases, you need to SELECT ... FOR UPDATE if you intend to perform an update that depends on the values you just selected.

amluto
5 replies
15h25m

I wonder if Datomic’s model has room for something like an “extra-strict” transaction. Such a transaction would operate exactly like an ordinary transaction except that it would also check that no transaction element reads a value or predicate that is modified by a different element. This would be a bit like saying that each element would work like an independent transaction, submitted concurrently, in a more conventional serializable database (with predicate locking!), except that the transaction only commits if all the elements would commit successfully.

This would have some runtime cost and would limit the set of things one could accomplish in a transaction. But it would remove a footgun, and maybe this would be a good tradeoff for some users, especially if it could be disabled on a per-transaction basis.

lgrapenthin
4 replies
14h53m

I wouldn't use it. The footgun is imaginary. I use Datomic for ten years and I can assure you that I never stepped on it. As a Datomic user you see transactions as clean small diffs, not as complicated multi step processes. This is actually much more pleasant to work with.

amluto
2 replies
14h26m

Now I’m curious: what’s a useful example of a Datomic transaction that reads a value in multiple of its elements and modifies it?

lgrapenthin
0 replies
13h31m

You could include two transaction functions that constrain a transaction to different properties about the same fact and then alter that fact. I don't know of a practical usecase or that I ever encountered that, it would be extremely rare IME.

hlship
0 replies
12h18m

In traditional databases, only the database engine has a scalable view of the data - that’s why you send SQL to it and stream back the response data set. With Datomic, the peer has the same level of read access as the transactor; it’s like the database comes to you.

In this read and update scenario, the peer will, at its leisure, read existing data and put together update data; some careful use of compare and set, or a custom transaction function, can ensure that the database has not changed between read and writes in such a way that the update is improper, when that is even a possibility - a rarity.

At scale, you want to minimize the amount of work the transactor must perform, since it so aggressively single threaded. Off loading work to the peer is amazingly effective.

aphyr
0 replies
12h23m

This is also good to hear! I'm not sure whether I'd call it a "footgun" per se--that's really an empirical question about how Datomic's users understand its model. I can say that as someone with some database experience and a few weeks of reading the Datomic docs, this issue actually "broke" several of the tests I wrote for Datomic. It was especially tricky because the transactions mostly worked as expected, but would occasionally "lose updates" or cause updates intended for one entity to wind up assigned to another.

Things looked fine in my manual testing, but when I ran the full test suite Elle kept catching what looked like serious Serializability violations. Took me quite a while to figure out I was holding the database wrong!

luc4sdreyer
3 replies
11h57m

I'm a bit worried that most of the links on https://www.datomic.com/ are broken.

JB024066
0 replies
3h23m

I think we just fixed that one. Sorry for the hiccups!

koito17
3 replies
22h46m

This is the first time I try reading a Jepsen report in-depth, but I really like the clear description of Datomic's intra-transaction behavior. I didn't realize how little I understood the difference between Datomic's transactions and those of SQL databases.

One thing that stands out to me is this paragraph

  Datomic used to refer to the data structure passed to d/transact as a “transaction”, and to its elements as “statements” or “operations”. Going forward, Datomic intends to refer to this structure as a “transaction request”, and to its elements as “data”.
What does this mean for d/transact-async and related functionality from the datomic.api namespace? I haven't used Datomic in nearly a year. A lot seems to have changed.

stuarthalloway
2 replies
20h50m

Datomic software needed no changes as a result of Jepsen testing. All functionality in datomic.api is unchanged.

klysm
1 replies
17h50m

Congrats, that is a rare outcome!

aphyr
0 replies
12h35m

Yeah, I think this is next to Zookeeper as one of the most positive Jepsen reports. :-)

jwr
3 replies
9h34m

This is a fantastic detailed report about a really good database. I'm also really happy to see the documentation being clarified and updated.

As a side note: I so wish Apple would pay for a Jepsen analysis of FoundationDB. I know Aphyr said that "their tests are likely better", but if indeed Jepsen caught no problems in FoundationDB, it would be a strong data point for another really good database.

mdaniel
2 replies
3h4m

I would never, ever want to take food out of aphyr's mouth, but is there something specific that makes either just creating the Jepsen tests somehow out of reach of a sufficiently motivated contributor, or is so prohibitively expensive that a "gofundme-ish" setup wouldn't get it done?

I (perhaps obviously?) am not well-versed in that space to know, but when I see "wish $foo would pay for" my ears perk up because there is so much available capital sloshing around and waiting on Apple to do something is (in my experience) a long wait

SOLAR_FIELDS
1 replies
2h45m

I have heard from people who paid for a Jepsen test that he is eye wateringly expensive (and absolutely, rightfully should be, there are very few people in the world that can conduct analyses on this level) but maybe achievable with a gofundme.

I am not sure, for the same reason, that designing a DIY Jepsen suite correctly is really achievable for the vast majority of people. Distributed systems are very hard to get right, which means that testing them is very hard to get right as well.

PeterCorless
0 replies
15m

He provides a good and unique service. He's worth every penny. Note that for some companies, the real "expense" is dedicating engineering hours to fix the shit he lit on fire in your code.

thom
2 replies
21h55m

I’ve not really spent much time with Datomic in anger because it’s super weird, but is any of this surprising? Datomic transactions are basically just batches and I always thought it was single threaded so obviously it doesn’t have a lot of race conditions. It’s slow and safe by design.

rtpg
1 replies
10h40m

Well the example of "incrementing x twice in the same transaction leads to x+1, not x+2" seems pretty important! I imagine you gotta be quite careful!

stuarthalloway
0 replies
7h23m

What does the following expression return?

(let [x 1] [(inc x) (inc x)])

In Clojure the answer is [2 2]. A beginner might guess [2 2] or [2 3]. Both are reasonable guesses, so a beginner needs to be quite careful!

But that isn't particularly interesting, because beginners always have to be quite careful. When you are learning any technology, you are a beginner once and experienced ever after. Tool design should optimize for the experienced practitioner. Immutability removes an enormous source of complexity from programs, so where it is feasible it is often desirable.

poidos
1 replies
15h53m

Really nice work as always. I love reading these to learn more about these systems, for little tidbits of writing Clojure programs, and for the writing style. Thanks for what you do!

aphyr
0 replies
12h36m

Thank you!

khalidx
1 replies
2h3m

Oh, boy, have I been waiting for this one! I've been building my own datomic-like datastore recently and this is going to be useful. Reading it now.

I enjoyed the MongoDB analyses. Make sure to check it out too as well as the one for Redis, RethinkDB, and others.

Would be great if there was an analysis done for rqlite/dqlite or turso/libsql at some point in the future.

bfors
1 replies
3h19m

For those who aren't aware, the name Jepsen is a play on Carly Rae Jepsen, singer behind "call me maybe". In my opinion a perfect name for a distributed systems research effort.

fulafel
0 replies
3h27m

The data model in Datomic is pretty intuitive if you're familiar with triple stores / RDF. But these similarities aren't very often referenced in by the docs or online discussions. Is it because people are rarely familiar with those concepts, or is the association with semantic web things considered potentially distracting, (or am I missing something and there are major fundamental differences)?

baq
0 replies
23h0m

aphyr you bastard I've got work to do today.

CrazyPyroLinux
0 replies
22h29m

aphyr had given some conference talks on previous analyses (available on youtube) that are informative and entertaining