Is something bugging you?

The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.

I feel like the idea of the legendary "10x" developer has been bastardized to just mean workers who work 15 hours a day 6.5 days a week to get something out the door until they burn out.

But here's your real 10x (or 50x) productivity. People who implement something very few people even considered or understood to be possible, which then gives amazing leverage to deliver working software in a fraction of the time.

I'm tired of hearing about 10x engineers. I just want to be a good 1x engineer. Or good at anything in life realy.

The “10x engineer” comes from the observation that there is a 10x difference in productivity between the best and the worst engineers. By saying that you want to be a 1x engineer, you’re saying you want to be the least productive engineer possible. 1x is not the average, 1x is the worst.

the worst engineer certainly has negative productivity, so I'm not sure that your explanation can possibly be the correct one.

I’m explaining what the terms “10x” and “1x” mean, not asserting that the original observation is correct under all circumstances.

Except you haven't explained it at all. Sackman, Erickson, and Grant found that some developers were able to complete what was effectively a programming contest in a 10th of the time of the slowest participants. This is the origin of the 10x developer idea.

You, on the other hand, are claiming that 10x engineers are 10 times more productive than the worst engineers. Completing a programming challenge in a 10th of the time is not the same as being 10 times more productive, and obviously your usage can't be an explanation, even as one you made up on the spot, as the math doesn't add up.

That was designed as a repeatable experiment, which seems entirely reasonable when you want to conduct a study. Why are you characterising that as “a programming contest”? That seems like an uncharitably distorted way of describing a study.

That study also does not exist in isolation:

https://www.construx.com/blog/the-origins-of-10x-how-valid-i...

i believe the original was for an entire "organizations" performance, and was also done in 1977. Since they are averages, It makes "sense" to conclude that the best of a good team is 10x better than the average of the worst team. Not really what the experimwnt concludes but what can you do.

The first was 1968, but there have been more studies since.

https://www.construx.com/blog/the-origins-of-10x-how-valid-i...

I'm not sure your math works.

What we do know is that the worst engineers provide negative productivity. If 1x is the worst engineer, then let's for the sake of discussion denote x as -1 in order for the product to be negative. Except that means the 10x engineer provides -10 productivity, actually making them the worst engineer. Therein lies a conflict.

What we also know is that best engineer has positive productivity, so that means the multiplicand must always be positive. Which means that it is the multiplier that must go negative, meaning that a -1x and maybe even a -10x engineer exists.

Thank you. This sounds so trivial at first, but your reductio ad absurdum at the beginning of your comment really nails it.

Throw into the mix the fact that productivity is hard to measure as soon as more than one person works on something and that doesn't even begin to consider the economical aspects of software.

And even when ignoring this point, there's that pesky short-term vs long-term thing.

Also, how do you define the term "productivity"? I was assuming that you mean somethint along the lines of (indirect, if employed) monetary output.

You are arguing against the idea that there is a factor of ten difference in productivity between the best and the worst engineers. That’s fine if you want to do that, but that’s explicitly where the term “10x engineer” comes from and what defines its meaning. So if you disagree with the underlying concept, there is no way for you to use terms like “[n]x engineer” coherently since you disagree with its most fundamental premise. You certainly shouldn’t reinvent different meanings for these terms.

Even if this was the origin of the term, it still doesn't make sense because the best engineers can solve problems the worst would never be able to do so. The difference between the best and worst is much more than 10x the worst. Maybe the worst who meets certain minimums at a company, but then the best would also be limited by those willing to work for what the company pays, and I hypothesis that the minimums of the lower bound and the maximums of the upper bound are correlated.

It sounds like you disagree with the concept of a 10x engineer then. In which case you should avoid using the term, rather than making up a new definition.

Hmm, I never thought of it that way. I just heard 10x employees and fit it to what I knew. Which is that 90% of the work is accomplished by about 10% of workers. The other 90% really only get 10% done. So most developers are somewhere on a scale of 0.1 - 1. With 1 being a totally competent and good developer. The 10x people are just different though, it's like a pro-athlete to a regular player. It's not unique to software development, though it may stand out and be sought after more. I've noticed it in pretty much every industry. Some people are just able to achieve flow state in their work and be vastly more productive than others, be it writing code or laying sod. I don't find that there's a lot of in between 1 and 10 though.

It depends on the day if I feel like a 2x or a 0.1x engineer. Keep at it. You are not alone!

Spend less time on HN and you might get more done.

The truest 10x engineer I ever encountered was a memory firmware guy with ASIC experience who absolutely made sure to log off at 5 every day after really putting in the work. Go to guy for all parts of the codebase, even that which he didn't expressly touch.

You took the words right out of my mouth

Your definition is also vague. Someone still needs to do the legwork. One man armies who can do everything themselves don't really fit in standardized teams where everything is compartmentalized and work divided and spread out.

They work best on their own projects with nobody else in their way, no colleagues, no managers, but that's not most jobs. Once you're part of a team, you can't do too much work yourself no matter how good you are, as inevitably the other slower/weaker team members will slow you down as you'll fight dealing with the issues they introduce into the project or the issues from management, so every team moves at the speed of the lowest common denominator no matter their rockstars.

That rings true and is probably why the 10x engineers I have seen usually work on devops or modify the framework the other devs are using in some way. For example, an engineer who speeds up a build or test suite by an order of magnitude is easily a 10x engineer in most organizations, in terms of man hours saved.

> For example, an engineer who speeds up a build or test suite by an order of magnitude is easily a 10x engineer in most organizations, in terms of man hours saved.

Yeah but this isn't something scalable that can happen regularly as part of your job description. Like most jobs/companies don't have so many low hanging fruits to pick that someone can speed of build by orders of magnitude on a weekly basis. It's usually a one time thing. And one time things don't usually make you a 10x dev. Maybe you just got lucky once to see something others missed.

And often times at big places most people know where the low hanging fruits are and can fix them, but management, release schedules and tech debt are perpetually in the way.

IMHO what makes you a 10x dev is you always know how to unblock people no matter the issue so that the project is constantly smooth saling, not chasing orders of magnitude improvements unicorns.

Like most jobs/companies don't have so many low hanging fruits to pick that someone can speed of build by orders of magnitude on a weekly basis

You and I have worked at very different organizations. Everywhere I've been has had insane levels of inefficiency in literally every process.

same here - it is especially bad in huge companies, the inefficiencies and waste are legendary.

>insane levels of inefficiency in literally every process.

In processes yes, not in code, and solo 10x devs alone can't fix broken processes as those are a the effect of broken management and engineering culture.

People know where the inefficiencies are, but management doesn't care.

Does anyone else feel like people follow these sort of industry pop-culture terms a bit too intensely? What I mean is that the existence of the term tends to bring out people trying to figure who that might be, as if it has to be 100% true.

I personally think that some people can provide “10x” (arbitrary) the value on occasion, like the low hanging fruit you said. I also believe some people are slightly more skilled than others, and get more results out of their work. That said, there are so many ways for somebody to have an impact that doesn’t have to immediate, that I find the term itself too prevalent.

"Does anyone else feel like people follow these sort of industry pop-culture terms a bit too intensely? "

Agreed, there is too much effort going into the "superstars" theme, but there are definitely people who get 10x done in the same time as others.

It really does depend on where you work. The order of magnitude improvements I'm describing involved interdisciplinary expertise involving both bespoke distributed build systems and assembly language. They're not unicorns, they do exist, but they are very rare and most engineers just aren't going to be able to find them, even with infinite time. Hence why a 10x engineer is so valuable and not everyone can be one. I myself am certainly not one, in most contexts.

No one reading this during the hours of 9-5 is a 10x.

Or is. If a 1x puts in an 8 hour day, a 10x only has to put in a 48 minute day. That leaves plenty of time to read this.

That’s a bad take because you’re assuming that developer is capable of replicating that * 10

When I was in college, I've met a few people that coded _a lot_ faster than me. Typically, they started since they were 12 instead of 21 (like me). That's how 10x engineers exist, by the time they are 30, they have roughly 20 years of programming experience behind their belt instead of 10.

Also, their professional experience is much greater. Sure, their initial jobs at 15 are the occassional weird gig for the uncle/aunt or cousin/nephew but they get picked up by professional firms at 18 and do a job next to their CS studies.

At least, that's how it used to be. Not sure if this is still happening due to the new job environment, but this was the reality from around 2004 to 2018.

For 10x engineers to exist, all it takes is a few examples. To me, everyone is in agreement that they seem to be rare. I point to a public 10x engineer. He'd never say it himself, but my guess is that this person is a 10x engineer [1].

If you disagree, I'm curious how you'd disagree. I'm just a blind man touching a part of the elephant [2]. I do not claim to see the whole picture.

[1] https://bellard.org/ (the person who created JSLinux)

[2] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant - if you don't know the parable, it's a fun one!

Yup, that's been my experience as someone who asked for a C++ compiler for my 12th birthday, worked on a bunch of random websites and webapps for friends of the family, and spent some time at age 16-17 running a Beowulf cluster and attempting to help postdocs port their code to run on MPI (with mixed success). All thru my CS education I was writing contributing (as much as I could) toward OSS, reading lots of stuff on best practices, and leaning on my much older (12 years) brother who was working in the industry. He pointed me to Java and IntelliJ, told me to read Design Patterns (Gang of Four) and Refactoring (Fowler). I read Joel on Software religiously, even though he was a Microsoft guy and I was a hardcore Linux-head.

By the time I joined my first real company at age 21, I was ready to start putting a lot of this stuff into place. I joined a small med device software company which had a great product but really no strong software engineering culture: zero unit tests, using CVS with no branches, release builds were done manually on the COO's workstation, etc.

As literally the most junior person in the company I worked through all these things and convinced my much more senior colleagues that we should start using release branches instead of "hey everybody, please don't check in any new code until we get this release out the door". I wrote automated build scripts mostly for my own benefit, until the COO realized that he didn't have to worry about keeping a dev environment on his machine, now that he didn't code any more. I wrote a junit-inspired unit testing framework for the language we were using (https://en.wikipedia.org/wiki/IDL_(programming_language) - like Matlab but weirder).

Without my work as a "10x junior engineer", the company would have been unable to scale to more than 3 or 4 developers. I got involved in hiring and made sure we were hiring people who were on board with writing tests. We finally turned into a "real" software company 2 or 3 years after I joined.

This sounds similar to the best programmer I personally know and he was an intern working at LLVM at the time. It's funny how companies treat that part of his life as "no experience". Then suddenly he goes into the HFT space and within a couple of years he has a similar rank that people have that are twice his age.

10x engineers exist. To be fair, it does depend which software engineer you see as "the standard software engineer", but if I take myself as a standard (as an employed software engineer with 5 years of experience), then 10x software engineers exist.

It seems like the industry would get a lot more 10x behavior if it was recognized and rewarded more often than it currently does. Too often, management will focus more on the guy who works 12 hour days to accomplish 8 hours of real work than the guy who gets the same thing accomplished in an 8 hour day. Also, deviations from 'normal' are frowned upon. Taking time to improve the process isn't built into the schedule; so taking time to build a wheelbarrow is discouraged when they think you could be hauling buckets faster instead.

That’s because most executives can’t understand technology deeply enough to know the difference.

It's almost impossible to get executives to think in return on equity (“RoE”) for the future instead of “costs” measured in dollars and cents last quarter.

Which is weird, since so many executives are working in a VC-funded environment, and internal work should be “venture funded” as well.

On my team, one of the main multipliers is understanding the need behind the requested implementation, and proposing alternative solutions - minimizing or avoiding code changes altogether. It helps that we work on internal tooling and are very close to the process and stakeholders.

"Hmmm, there's another way to accomplish this" being the 10x. Doing things faster is not it.

Exactly this. It’s why it’s so frustrating when product managers who think they’re above giving background run the show (the ones who think they’re your manager and are therefore too important to share that with you)

This might be the best introduction post I've read.

Lays the foundation (get it?) for who the people are and what they've built.

Then explains how the current thing they are building is a result of the previous thing. It feels that they actually want this problem solved for everyone because they have experienced how good the solution feels.

Then tells us about the teams (pretty big names with complex systems) that have already used it.

All of these wrapped in good writing that appeals to developers/founders. Landing page is great too!

Except it doesn't actually explain in what it does: Is it fuzzing? Do you supply your own test cases? Is it testing hardware non-determinism?

Post author here. Sorry it was vague, but there's only so much detail you can go into in a blog post aimed at general audiences. Our documentation (https://antithesis.com/docs/) has a lot more info.

Here's my attempt at a more complete answer: think of the story of the blind men and the elephant. There's a thing, called fuzzing, invented by security researchers. There's a thing, called property-based testing, invented by functional programmers. There's a thing, called network simulation, invented by distributed systems people. There's a thing, called rare-event simulation, invented by physicists (!). But if you squint, all of these things are really the same kind of thing, which we call "autonomous testing". It's where you express high-level properties of your system, and have the computer do the grunt work to see if they're true. Antithesis is our attempt to take the best ideas from each of these fields, and turn them into something really usable for the vast majority of software.

We believe the two fundamental problems preventing widespread adoption of autonomous testing are: (1) most software is non-deterministic, but non-determinism breaks the core feedback loop that guides things like coverage-guided fuzzing. (2) the state space you're searching is inconceivably vast, and the search problem in full generality is insolubly hard. Antithesis tries to address both of these problems.

So... is it fuzzing? Sort of, except you can apply it to whole interacting networked systems, not just standalone parsers and libraries. Is it property-based testing? Sort of, except you can express properties that require a "global" view of the entire state space traversed by the system, which could never be locally asserted in code. Is it fault injection or chaos testing? Sort of, except that it can use the techniques of coverage guided fuzzing to get deep into the nooks and crannies of your software, and determinism to ensure that every bug is replayable, no matter how weird it is.

It's hard to explain, because it's hard to wrap your arms around the whole thing. But our other big goal is to make all of this easy to understand and easy to use. In some ways, that's proved to be even harder than the very hard technological problems we've faced. But we're excited and up for it, and we think the payoff could be big for our whole industry.

Your feedback about what's explained well and what's explained poorly is an important signal for us in this third very hard task. Please keep giving it to us!

> most software is non-deterministic

Doesn't Antithesis rely on the fact that software is always deterministic? Reproducibility appears to be its top selling feature – something that wouldn't be possible if software were non-deterministic.

We can force any* software to be deterministic.

* Offer only good for x86-64 software that runs on Linux whose dependencies you can install locally or mock. The first two restrictions we will probably relax someday.

Aren't you just 'forcing' determinism in the inputs, relying on the software to be always deterministic for the same inputs?

Nope. We’re emulating a deterministic computer, so your software can’t act nondeterministically if it tries.

Right, by emulating a deterministic computer you can ensure that the inputs to the software are always deterministic – something traditional computing environments are unable to offer for various reasons.

However, if we pretend that software was somehow able to be non-deterministic, it would be able to evade your deterministic computer. But since software is always deterministic, you just have to guarantee determinism in the inputs.

Has any thought been given to repurposing this deterministic computer for more than just autonomous testing/fuzzing? For example, given an ability to record/snapshot the state, resumable software (i.e. durable execution)?

Somebody once suggested to me that this could be very hand for the reproducible builds folks. I'm sure that now that we're out in the open, lots of people will suggest great applications for it.

Disclosure: Antithesis co-founder.

My favourite application for "deterministic computer" is creating a cluster in order to have a virtual machine which is resilient to hardware failure. Potentially even "this VM will keep running even if an entire AWS region goes down" (although that would add significant latency).

I remember watching the Strange Loop video on your testing strategy, and now I need to go back and relearn how it differed from model checking (ie Promela or TLA+). Model checking is probably the big QA story that tech companies ignore because it requires dramatically more education, especially from QA departments typically seen as "inferior" to SWE.

Video of [0] the Strangeloop talk [1].

[0] https://www.youtube.com/watch?v=4fFDFbi3toc [1] https://thestrangeloop.com/2014/testing-distributed-systems-...

This vaguely reminds me of Jefferson's "Virtual Time" paper from 1985[1]. The underlying idea at the time didn't really take off because it required, like Zookeeper, a greenfield project: except that it kinda doesn't and today you could imagine instrumenting an entire Linux syscall table and letting any Linux container become a virtual time system -- but Linux didn't exist in 1985 and wouldn't be standard until much later.

So Jefferson just says, let's take your I/O-ful process, split it a message-passing actor model, and monitor all the messages going in and coming out. The messages coming out, they won't necessarily do what they're supposed to do yet, they'll just be recorded with a plus sign and a virtual timestamp, and by assumption eventually you'll block on some response. So we have a bunch of recorded message timestamps coming in, we have your recorded messages going out.

Well, there's a problem here, which is that if we have multiple actors we may discover that their timestamps have traveled out-of-order. You sent some message at t=532 but someone actually sent you a message at t=231 that you might have selected instead of whatever you actually selected to send the t=532 message. (For instance in the OS case, they might have literally sent a SIGKILL to your process and you might not have sent anything after that.) That's what the plus sign is for, indirectly: we can restart your process from either a known synchronization state or else from the very beginning, we know all of its inputs during its first run so we have "determinized" it up past t=231 to see what it does now. Now, it sends a new message at say t=373. So we use the opposite of +, the minus sign, to send to all the other processes the "undo" message for their t=532 message, this removes it from their message buffer: that will never be sent to them. And if they haven't hit that timestamp in their personal processing yet, no further action is needed, otherwise we need to roll them back too. Doing so you determinize the whole networked cluster.

The only other really modern implementation of these older ideas that I remember seeing was Haxl[2], a Haskell library which does something similar but rather than using a virtual time coordinate, it just uses a process-local cache: when you request any I/O, it first fetches from the cache if possible and then if that's not possible it goes out, fetches the data, and then caches it. As a result you can just offer someone a pre-populated cache which, with these recorded inputs, will regenerate the offending stack trace deterministically.

1: https://dl.acm.org/doi/10.1145/3916.3988

2: https://github.com/facebook/Haxl

Sure, it doesn't go into details. And that is exactly why I termed it an excellent introduction and a sales pitch.

I haven't heard of deterministic testing before. Nor have I heard of FoundationDB or the related things. And I went from knowing zero things about them to getting impressed and interested. This led me to go into their docs, blogs, landing page, etc. to know more.

Yeah. I could figure out the global idea, but then the mechanics of how it would actually work were very sparse.

Did you read a different article than me?

The linked article is 3/4 about some history and rationale before it actually tells you what they build.

It's like those pesky recipe blogs that tell you about the authors childhood, when you just want to make vegan pancakes.

The entire testing system they describe feels like something I can strive towards too. They make you want their solution because it offers a way of life and thinking and doing like you've never experienced before

This is a great pitch, and I don't want to come across as negative, but I feel like a statement like "we found all bugs" can only be true with a very narrow definition of bug.

The most pernicious, hard-to-find bugs that I've come across have all been around the business logic of an application, rather than it hitting into an error state. I'm thinking of the category where you have something like "a database is currently reporting a completed transaction against a customer, but no completed purchase item, how should it be displayed on the customer recent transactions page?". Implementing something where "a thing will appear and not crash" in those cases is one thing, but making sure that it actually makes sense as a choice given all the context of everyone elses choices everywhere else in the stack is a lot harder.

Or to take a database, something along the lines of "our query planner produces a really suboptimal plan in this edge-case".

Neither of those types of problems could ever be automatically detected, because they aren't issues of the programming reaching an error state- the issue is figuring out in the first place what "correct" actually is for you application.

Maybe I'm setting the bar too high for what a "bug" is, but I guess my point is, its one thing to fantasize about having zero bugs, its another to build software in the real world. I probably still settle for 0 run time errors though to be fair. . .

I consider a "bug" to be "it was supposed to do something and failed".

Issues around business logic are not failures of the system, the system worked to spec, the spec was not comprehensive enough and now we iterate.

Systems Engineering has terminology for this distinction.

Verification is "does this thing do what I asked it to do".

Validation is "did I ask it to do the right thing".

Tangentially related, but I've recently started distinguishing verification and validation in my data cleaning work:

verification refers to "is this dataset clean?" or the more precise "does this dataset confirm my assumptions about what a what a correct dataset should be given its focus"

validation refers to "can it answer my questions?" or the more rigorous "can I test my hypotheses against this dataset?"

So I find this interesting (but in hindsight unsurprising) that similar definitions are used in other fields. Would you have a source for your defintions?

...And now we could probably start debating your narrow definition of "system". ;-)

Most of the software I've built doesn't have "a spec.", but let me zoom in on specs. around streaming media. MPEG DASH, CMAF or even the base media file format (ISO/IEC 14496-12) at times can be pretty vague. In practice, this frequently turns up in actual interoperability issues where it's pretty difficult to point out which of two products is according to spec and which one has a bug.

So yes, I totally agree with GP and would actually go further: a phrase like "we found all the bugs in the database" is nonsense and makes the article less credible.

What do you call it when the spec is wrong? Like clearly actually wrong, such as when someone copied a paragraph from one CRUD-describing page to the next and forgot to change the word "thing1" to "thing2" in the delete description.

Because I'd call that a bug. A spec bug, but a bug. It's no feature request to make the code based on the newer page delete thing2 rather than thing1, it's fixing a defect

Good summary of the hard part of being a software developer that deals with clients.

What software developer does not deal with clients (and makes a living)?

lots of software developers never deal with clients (clients as in the people who will actually use the software) - most of them in fact, in any of the big companies I have worked for anyway...and that is probably not a good thing.

I myself, prefer to work with the people who will actually use what I build - get a better product hat way.

I think the reference to "all the bugs" here is basically that our insanely brutal deterministic testing system was not finding any more bugs after 100's of thousands of runs. Can't prove a negative obviously, but the fact that we'd gotten to that "all green" status gave us a ton of confidence to push forward in feature development, believing we were building on something solid - which, time has shown we were.

Thanks -- that's very clarifying! But isn't this circular? The lack of bugs is used as evidence of the effectiveness of the testing approach, but the testing approach is validated by...not finding any more bugs in the software?

I do think that it was a mistake to use the word "all" and imply that there are absolutely no bugs in FoundationDB. However, FoundationDB is truly known as having advanced the state of the art for testing practices: https://apple.github.io/foundationdb/testing.html.

So in normal cases this would reek of someone being arrogant / overconfident, but here they really have gotten very close to zero bugs.

The other issue I would point out is that building a database, while impressive with their quality, is still fundamentally different than an application or set of applications like a larger SaaS offering would involve (api, web, mobile, etc). Like the difference between API and UI test strategies, where API has much more clearly defined and standardized inputs and outputs.

To be clear, I am not saying that you can't define all inputs and outputs of a "complete SaaS product offering stack", because you likely could, though if it's already been built by someone that doesn't have these things in mind, then it's a different problem space to find bugs.

As someone who has spent the last 15 years championing quality strategy for companies and training folks of varying roles on how to properly assess risk, it does indeed feel like this has a more narrow scope of "bug" as a definition, in the sort of way that a developer could try to claim that robust unit tests would catch "any" bugs, or even most of them. The types of risk to a software's quality have larger surface areas than at that level.

I wonder if they are working on a time travel debugger. If it is truly deterministic presumably you could visit any point in time after a record is made and replay it.

No comment. :-)

Disclosure: I am a co-founder of Antithesis.

It looks amazing, nice work!

Do you have any plans to let small open source teams use the project for free? Obviously you have bills to pay and your customers are happy to do that, but I was wondering if you'd allow open source projects access to your service once a week or something.

Partly because I want to play with this and I can't see my employer or client paying for it! But also it fits neatly into "DX", the Developer Experience, i.e. making the development cycle as friction free for devs as possible. I'm DevOps with a lifelong interest in UX, so DX is something I'm excited about.

Pricing suitable for small teams, and perhaps even a free tier, is absolutely on the roadmap. We decided to build the "hard", security-obsessed version of the infrastructure first -- single-tenant, with dedicated and physically isolated hardware and networking for every customer. That means there's a bit of per-customer overhead that we have to recoup.

In the future, we will probably have a multi-tenant offering that's easier for open source projects to adopt. In the meantime, if your project is cool and would benefit from our testing, you can try to get our research team interested in using it as part of the curriculum that makes our platform smarter.

Disclosure: I'm an Antithesis co-founder.

We've actually done quite a bit of testing on open source projects as we've built this, and have discussed doing an on-going program of testing open source projects that have interested contributors. We'd probably find some interesting things and could do some write-ups. Reach out to us via our contact page or contact@antithesis.com and let's chat.

[I work at Antithesis]

The system can certainly revisit a previous simulated moment and replay it. And we have some pretty cool things using that capability as a primitive. Check out the probability chart in the bug report linked from the demo page: https://antithesis.com/product/demo

Now I want a simulation-run replay scrubbing slider MIDI-connected to my Pioneer DJ rig to scratch through our troublesome tests as my homies push patched containers.

Seriously: impressive product revelation.

Let's do it.

That’s what rr-project does essentially?

That's exactly what Tomorrow Corporation uses for their hand written game engine and compiler: https://www.youtube.com/watch?v=72y2EC5fkcE

The writing is really enjoyable.

Programming in this state is like living life surrounded by a force field that protects you from all harm. [...] We deleted all of our dependencies (including Zookeeper) because they had bugs, and wrote our own Paxos implementation in very little time and it _had no bugs_.

Being able to make that statement and back it by evidence must be indeed a cool thing.

I have proved my code has no bugs according to the spec.

I do not make the claim my spec has no bugs.

With formal proof systems, you can also claim that for your spec.

A formal proof is only as good as what-you-are-proving maps to what-you-intended-to-prove.

I've written formal proofs with bugs more than once. Reality is much messier than you can encode into any proof and there will ultimately be a boundary where the real systems you're trying to build can still have bugs.

Formal verification is incredibly, amazingly good if you achieve it, but it's not the same as "perfect".

"Its not a bug, its a feature"

The earliest that I've seen the attitude that one should eliminate dependencies because they have more bugs than internally written code was this book from 1995: https://store.doverpublications.com/products/9780486152936

pp. 65-66:

The longer I have computed, the less I seem to use Numerical Software Packages. In an ideal world this would be crazy; maybe it is even a little bit crazy today. But I've been bitten too often by bugs in those Packages. For me, it is simply too frustrating to be sidetracked while solving my own problem by the need to debug somebody else's software. So, except for linear algebra packages, I usually roll my own. It's inefficient, I suppose, but my nerves are calmer.

The most troubling aspect of using Numerical Software Packages, however, is not their occasional goofs, but rather the way the packages inevitably hide deficiencies in a problem's formulation. We can dump a set of equations into a solver and it will usually give back a solution without complaint - even if the equations are quite poorly conditioned or have an unsuspected singularity that is distorting the answers from physical reality. Or it may give us an alternative solution that we failed to anticipate. The package helps us ignore these possibilities - or even to detect their occurrence if the execution is buried inside a larger program. Given our capacity for error-blindness, software that actually hides our errors from us is a questionable form of progress.

And if we do detect suspicious behavior, we really can't dig into the package to find our troubles. We will simply have to reprogram the problem ourselves. We would have been better off doing so from the beginning - with a good chance that the immersion into the problem's reality would have dispelled the logical confusions before ever getting to the machine.

I suppose whether to do this depends on how rigorous one is, how rigorous certain dependencies are, and how much time one has. I'm not going to be writing my own database (too complicated, multiple well-tested options available) but if I only use a subset of the functionality of a smaller package that isn't tested well, rolling my own could make sense.

In the specific case in question, the biggest problem was that dependencies like Zookeeper weren't compatible with our testing approach, so we couldn't do true end to end tests unless we replaced them. One of the nice things about Antithesis is that because our approach to deterministic simulation is at the whole system level, we can do it against real dependencies if you can install them.

I was a co-founder of both FoundationDB and Antithesis.

FoundationDB is an impressive achievement, quite possibly the only distributed database out there that lives up to its strict serializability claims (see https://jepsen.io/consistency/models/strict-serializable for a good definition). The way they wrote it is indeed very interesting and a tool that does this for other systems is immediately worth looking at.

Is it that good? I've been tasked to deploy it for sometime and it always bit me in the ass for one reason or another. And I'm not the one who use it so I don't know if it's actually good. For now I much prefer redis.

It depends how you define "good". I care mostly about my distributed database being correct, living up to its consistency claims, and providing strict serializability.

I care much less about how easy it is to use or deploy, but "good" is a subjective term, so other people might see things differently.

It's great, but operationally there are lots of gotchas and little guidance.

We got bitten _hard_ in production when we accidentally allowed some of the nodes to get above 90% of the storage used. The whole database collapsed into a state where it could only do a few transactions a second. Then the ops team, thinking they were clever, doubled the size of the cluster in order to give it the resources it needed to get the average utilization down to 45%; this was an unforced error as that pushed the size of the cluster outside the fdb comfort zone (120 nodes) which is itself a problem. The deed was done though and pulling nodes was not possible in this state, so slowly, slooooowly... things got fixed.

We ended up spending an entire weekend slowly, slowly getting things back into a good place. We did not lose data, but basically prod was down for the duration, and we found it necessary to _manually_ evict the full nodes one at a time over the period.

Now, this was a few years ago, and fdb has performed wickedly fast, with utter, total reliability before that and since, and to this day the ops team is butthurt about fdb.

From an engineering perspective, if you aren't using java fdb is pretty not great, since the very limited number of abstraction layers that exist are all java-centric. There are many, many issues with the maximum transaction time thing, the maximum key size and value size and total transaction size issue, the lack of pushdown predicates (e.g., filtered scans can't be done in-place which means that in AWS, they cost a lot in inter-az network charge terms and also are gated by the network performance of your instances), and so on.

What ALL of these have issues have in common is that they bite you late in the game. The storage issue bites you when you're hitting the DB hard in production and have a big data set, the lack of abstractions means that even something as finding leaked junked keys turns out to be impossible unless you were diligent to manually frame all your values so you could identify things as more than just bytes, the transaction time thing is very weird to deal with as you tend to have creeping crud aspects and the lack of libraries that instrument the transactions to give you early warning is an issue, likewise for certain kinds of key-value pairs, there's a creeping size problem - hey, this value is an index of other values; if you're not very careful up front, you _will_ eventually hit either the txn size limit or the key limit. The usual workarounds for those is to do separate transactions - a staging transaction, then essentially a swap operation and then a garbage collection transaction - but that has lots of issues overtime when coupled with application failure.

There are answers to ALL of these, manual ones. For the popular languages other than java - Go, python, maybe Ruby - there _should_ be answers for them, but there aren't. These are very sharp edges. Those java layers are _also_ _not_ _bug_ _free_. So yeah, one has a reliable storage storage layer (a topic that has come up over and over again in the last few years) but it's the layer on top of that where all the bugs are, but now with constraints and factors that are harder to reason about than the usual storage layer.

One might say, hey, SQL has all of these problems too, except no. You can bump into transaction limits, but the limits are vastly higher than fdb and the transaction time sluggishness will identify it long before you run into the "your transaction is rejected, spin retrying something that will _never_ recover" sort of issue that your average developer will eventually encounter in fdb.

That said, I love fdb as a software achievement. I just wish they had finished it. For my current project, I have designed it out. I might be able to avoid all of the sharp edges above at this point, but since we are not a java shop, I also can't rely on all the engineers to even know they exist.

quite possibly the only distributed database out there that lives up to its strict serializability claims

Jepsen has never tested FoundationDB, not sure why you claim this and link to Jepsen's site.

FDB co-founder here.

Aphyr / Jepsen never tested FDB because, as he tweeted "their testing appears to be waaaay more rigorous than mine." We actually put a screen cap of that tweet in the blog post linked here.

not sure why you claim this and link to Jepsen's site.

They link to the website for a definition of the term they are using.

the palantir testimonial on the landing page is funny

Even funnier if you manage to click "Declassify" :)

you're ip address is probably in the palantir databases anyway :o

And if you highlight the redactions, it reads:

REDACTED REDACTED REDACTED REDACTED REDACTED REDACTED and REDACTED REDACTED? REDACTED REDACTED Antithesis REDACTED REDACTED REDACTED REDACTED, REDACTED REDACTED REDACTED REDACTED. REDACTED REDACTED Palantir REDACTED REDACTED REDACTED REDACTED REDACTED REDACTED REDACTED.

:-)

This sort of awkward joke made to cover for capitalist illogic makes us all dumber.

I'm trying to avoid diving into the hype cycle about this immediately - but this sounds like the holy grail right? Use your existing application as-is (assuming it's containerized), and simply check properties on it?

The blocker in doing that has always been the foundations of our machines: non-deterministic CPUs and operating systems. Re-building an entire vertical computing stack is practically impossible, so they just _avoid_ it by building a high-fidelity deterministic simulator.

I do wonder how they are checking for equivalence between the simulator and existing OS's, as that sounds like a non-trivial task. But, even still, I'm really bought in to this idea.

Does it even need to be containerized? According to the post, it sounds like Antithesis is a solution at the hypervisor layer.

Yes it looks like containerization is required: https://antithesis.com/docs/getting_started/setup.html#conta...

Containers are doing two jobs for us: they give our customers a convenient way to send us software to run, and they give us a convenient place to simulate the network boundary between different machines in a distributed system. The whole guest operating system running the containers is also running inside the deterministic hypervisor and under test (and it's mostly just NixOS Linux, not something weird that we wrote).

I'm a co-founder of Antithesis.

Oh, cool to hear you're using NixOS. The Nix philosophy totally gels with the philosophy described in the post.

But it's also probably fair to describe NixOS as something weird that somebody else wrote :)

Gosh, I know it's a bit late, but I wish they'd called the product _The Prime Radiant_

Fans of Asimov's _Foundation_ series will appreciate the analogue to how this system aims to predict every eventuality based on every possible combination of events, a la psychohistory.

P.S. amazing intro post. Can't wait to try the product.

It would be the opposite of the product:

For a software not interacting with the real world, there is only one possibility for frame N+1, if you know the state of a system.

https://en.wikipedia.org/wiki/Determinism

PRNG are illusions, just misunderstood by humans.

Feels like I may have brought a spoon to a gun fight, but I would have considered psychohistory to be the ultimate extrapolation of determinism, and the fact that the prime radiant is able to predict _which_ version of events will happen is because it (somehow) knows the state of the system.

Of course, to argue against myself, it would surely be based on layers of probabilities, and they say several times in the series that it can't predict low-level specific things, just high-level things. And perhaps the whole underlying question posed by the series is whether the universe really is deterministic. But anyway I don't think it's all off-base.

Did you intend to link to https://en.wikipedia.org/wiki/Deterministic_algorithm?

Sounds a bit like jockey applied to qemu. Very neat indeed.

https://www.cs.purdue.edu/homes/xyzhang/spring07/Papers/HPL-...

There's indeed a connection between record/replay and deterministic execution, but there's a difference worth mentioning, too. Both can tell you about the past, but only deterministic execution can tell you about alternate histories. And that's very valuable both for bug search (fuzzing works better) and for debugging (see for example the graphs where we show when a bug became likely to occur, seconds before it actually occurred).

(Also, you won't be able to usefully record a hypervisor with jockey or rr, because those operate in userspace and the actual execution of guest code does not. You could probably record software cpu execution with qemu, but it would be slow)

I'm a co-founder of Antithesis.

I assume deterministic execution also lets you do failing test case reduction.

I've found this sort of high volume random testing w. test case reduction is just a game changer for compiler testing, where there's much the same effect at quickly flushing out newly introduced bugs.

I like the subtle dig at type systems. :)

I have been down this road a little bit, applying the ideas from jockey to write and ship a deterministic HFT system, so I have some understanding of the difficulties here.

We needed that for fault tolerance, so we could have a hot synced standby. We did have to record all inputs (and outputs for sanity checking) though.

We did also get a good taste of the debugging superpowers you mention in your blog article. We could pull down a trace from a days trading and replay on our own machines, and skip back and forth in time and find the root cause of anything.

It sounds like what you have done is something similar, but with your own (AMD64) virtual machine implementation, making it fully deterministic and replayable, and providing useful and custom hardware impls (networking, clock, etc).

That sounds like a lot of hard but also fun work.

I am missing something though, in that you are not using it just for lockstep sync or deterministic replays, but you are using it for fuzzing. That is, you are altering the replay somehow to find crashes or assertion failures.

Ah, I think perhaps you are running a large number of sims with a different seed (for injecting faults or whatnot) for your VM, and then just recording that seed when something fails.

I've been super interested in this field since finding out about it from the `sled` simulation guide [0] (which outlines how FoundationDB does what they do).

Currently bringing a similar kind of testing in to our workplace by writing our services to run on top of `madsim` [1]. This lets us continue writing async/await-style services in tokio but then (in tests) replace them with a deterministic executor that patches all sources of non-determinism (including dependencies that call out to the OS). It's pretty seamless.

The author of this article isn't joking when they say that the startup cost of this effort is monumental. Dealing with every possible source of non-determinism, re-writing services to be testable/sans-IO [2], etc. takes a lot of engineering effort.

Once the system is in place though, it's hard to describe just how confident you feel in your code. Combined with tools like quickcheck [3], you can test hundreds of thousands of subtle failure cases in I/O, event ordering, timeouts, dropped packets, filesystem failures, etc.

This kind of testing is an incredibly powerful tool to have in your toolbelt, if you have the patience and fortitude to invest in it.

As for Antithesis itself, it looks very very cool. Bringing the deterministic testing down the stack to below the OS is awesome. Should make it possible to test entire systems without wiring up a harness manually every time. Can’t wait to try it out!

[0]: https://sled.rs/simulation.html

[1]: https://github.com/madsim-rs/madsim?tab=readme-ov-file#madsi...

[2]: https://sans-io.readthedocs.io/

[3]: https://github.com/BurntSushi/quickcheck?tab=readme-ov-file#...

Dealing with every possible source of non-determinism, re-writing services to be testable/sans-IO [2], etc. takes a lot of engineering effort.

Are there public examples of what such a re-write looks like?

Also, are you working at a rust shop that's developing this way?

Final Note, TigerBeetle is another product that was written this way.

TigerBeetle is actually another customer of ours. You might ask why, given that they have their own, very sophisticated simulation testing. The answer is that they're so fanatical about correctness, they wanted a "red team" for their own fault simulator, in case a bug in their tests might hide a bug in their database!

I gotta say, that is some next-level commitment to writing a good database.

Disclosure: Antithesis co-founder here.

Sure! I mentioned a few orthogonal concepts that go well together, and each of the following examples has a different combination that they employ:

- the company that developed Madsim (RisingWave) [0] [1] is tries hardest to eliminate non-determinism with the broadest scope (stubbing out syscalls, etc.)

- sled [2] itself has an interesting combo of deterministic tests combined with quickcheck+failpoints test case auto-discovery

- Dropbox [3] uses a similar approach but they talk about it a bit more abstractly.

Sans-IO is more documented in Python [4], but str0m [5] and quinn-proto [6] are the best examples in Rust I’m aware of. Note that sans-IO is orthogonal to deterministic test frameworks, but it composes well with them.

With the disclaimer that anything I comment on this site is my opinion alone, and does not reflect the company I work at —— I do work at a rust shop that has utilized these techniques on some projects.

TigerBeetle is an amazing example and I’ve looked at it before! They are really the best example of this approach outside of FoundationDB I think.

[0]: https://risingwave.com/blog/deterministic-simulation-a-new-e...

[1]: https://risingwave.com/blog/applying-deterministic-simulatio...

[2]: https://dropbox.tech/infrastructure/-testing-our-new-sync-en...

[3]: https://github.com/spacejam/sled

[4]: https://fractalideas.com/blog/sans-io-when-rubber-meets-road...

[5]: https://github.com/algesten/str0m

[6]: https://docs.rs/quinn-proto/0.10.6/quinn_proto/struct.Connec...

On mobile, the "Let's talk" button in the top right corner is cut off by the carousel menu overlay. Seems like CSS is still out of scope of the bug fixing magic for now.

On a more serious note, it's an interesting blog post, but it comes off as veeery confident about what is clearly an incredibly broad and complex topic. Curious to see how it will work in production.

Aww... crap, you're right. I knew we should have finished the UI testing product and run it on ourselves before launching.

Disclosure: Antithesis co-founder.

Yeah, if only there was some scientific way to ensure that elements don't overlap, let's call it "constraints" maybe, so one could test layouts by simply solving, idk... something like a set of linear equations? Hope some day CSS will stop being "aweso"me and become nothing in favor of a useful layout system.

Designer here, sorry, it is intentional. I thought horizontally scrollable menu is more straightforward than full screen expander.

I kept cringing when I read the words “no bugs.”

This is hubris in the classic style - it’s asking for a literal thunderbolt from the heavens.

It may be true, but…come on.

Everyone who has ever written a program has thought they were done only to find one more bug. It’s the fundamental experience of programming to asymptotically approach zero bugs but never actually get there.

Again, perhaps the claim is true but it goes against my instincts to entertain the possibility.

I think there is something interesting about the fact that someone writing "no bugs" makes us all uncomfortable.

If they really did have a complex product, running in production from a sizeable userbase and had 2 bug reports ever, then I think it's a reasonable thing to say.

The fact that it isn't a reasonable thing to say for the most other software is a little sad.

Right, the claim may be true, but I have a visceral reaction to it. And tbh I'd be hesitant to work with someone who made a zero-bugs claim about their own work.

This is really exciting.

I am an absolute beginner at TLA+ but I really like this possible design space.

I have an idea for a package manager that combines type system with this style of deterministic testing and state space exploration.

Imagine knowing that your invocation of

   package-manager install <tool name>

Will always work because file system and OS state are part of the deterministic model.

or an next gen Helm with type system and state space exploration is tested:

   kubectl apply <yaml>

will always work when it comes up because all configuration state space exploration has been tested thanks to types.

Coincidence, I'm reading this and thinking about test harnesses for my package manager idea, which is really just a thin wrapper around nix, designed under the assumption that the network might partition at any moment: keep the data nearest where it's needed, refer by hash not by name, gossip metadata necessary to find the hash for you, no single points of failure.

Tell me more about yours?

I am thinking about state machine progressions and TLA+ style specifications which are invariants over a progression of variables.

Your package manager knows your operating system's current state and the state space of all the control flow graph through the program and configuration together can go to, it can verify that everything lines up and there will be no error when executed a bit like a compiler but without causing the Halting problem.

In TLA+ you can dump a state graph as a dot file, which I turn into a SVG and run with TLA+ graph visualiser.

Types verify possible control flow is valid at every point. We just need to add types to the operating system and file system and represent state space for deterministic verification.

You could hide packages that won't work.

The package manager would have to lookup precached state spaces or download them as part of the verification process.

To me this is very reminiscent of time travel debugging tools like the one used for Firefox’s C++ code, rr / Pernosco: https://pernos.co/

Seems more like a fuzzer for Docker images.

Like this: https://docs.gitlab.com/ee/user/application_security/coverag...

It won't tell you whether the software works correctly, it will just tell you if it raises an exception or crashes.

Put a fuzzer on Chrome for example, you won't catch most of the issues it has, though Chrome actually has tons of bugs and issues, but you may find security issues if you devote a big enough budget to run your fuzzer long time enough to cover all the branches.

So it's good in the case where you use "exceptions as tests", where any minor out-of-scope behavior raises an exception and all the cases are pre-planned (a bit like you baked-in runtime checks, and the fuzzer explores them)

The similarity is about obtaining determinism through something like a hypervisor. The way rr works is it basically writes down the result of all the system calls, etc, basically everything that ended up on the Turing machine’s tape, so you can rewind and replay.

I got really excited about this, and I spent a little time looking through the documentation, but I can't figure out how this is different than randomizing unit tests? It seems if I have a unit test suite already, then that's 99% of the work? Am I misunderstanding? I am drawing my conclusions from reading the Getting Started series of the docs, especially the Workloads section: https://antithesis.com/docs/getting_started/workload.html

Antithesis here - curious what part of the Getting Started doc gave you that impression? If you take a look at our How Antithesis Works page, it might help answer you question as to how Antithesis is different from just bundling your unit tests.

https://antithesis.com/docs/introduction/how_antithesis_work...

In short though, unit tests can help to inform a workload, but we don't require them. We autonomously explore software system execution paths by introducing different inputs, faults, etc., which discovers behaviors that may have been unforeseen by anyone writing unit tests.

This is that, and the exact same vibe, except: it promises to keep being that simple even after you add threads, and locks, and network calls, and disk accesses and..

With this, if you write a test for a function that makes a network call and writes the result to disk, your test will fail if your code does not handle the network call failing or stalling indefinitely, or the disk running out of space, or the power going out just before you close the file, or..

So it’s; yes, but it expands the space where testing is as easy as unit testing to cover much more interesting levels of complexity

interesting, this kind of responsive environment is dear but rare

i can't recall the last time i went to a place and people even considered investing in such setups

i assume that except for hard problems and teams seeking challenges, most people will revert to the mean and refuse any kind of infrastructure work because it's mentally more comfortable piling features and fixing bugs later

ps: i wish there was a meetup of teams like this, or even job boards :)

We'll be starting some meetups, attending conferences, etc. this year. Also hop into our Discord if you want to chat, lots of us are in there regularly. discord.gg/antithesis

oh, that's cool, thanks

Reminds me of the clever hack of playing back TCP dump logs from prod on a test network, but dialed up. Neat.

Naturally I’d prefer professional programmers learn the cognitive tools for manageably reasoning about nondeterminism, but they’ve been around over half a century and it hasn’t happened yet.

What’s really interesting to me is that the simulation adequately replicates the real network. One of the more popular criticisms of analytical approaches is sone variant of: yeah, but the real network isn’t going to behave like your model. Which by the way is an entirely plausible concern for anyone who has messed with that layer.

Naturally I’d prefer professional programmers learn the cognitive tools for manageably reasoning about nondeterminism

It’s not an either-or here, though. Part of the challenge is you’re not always thinking about all the non-determinisms in your code, and the interconnections between your code and other code (whose behavior you can sometimes only assume) can make that close to impossible. Part of that is the “your model of the network” critique, but also part of that is “your model of how people will use your software” isn’t necessarily correct either.

What is interesting here is that the solution could fuzz-test anything, including the network model, leading to failures even more implausible than reality.

I was mentally hijacked into clicking the jobs link (despite recently deciding I wasn’t going to go down that rabbit hole again!) but fortunately/unfortunately it is in-person and daily so, so flying out from Chicago a week out of the month won’t work and I don’t even have to ask!

More to the point of the story (though I do think the actual point was indeed a hiring or contracting pitch), this reminds me a lot of the internal tests the SQLite team has. I would love to hear from someone with access to those if they feel the same way.

I was mentally hijacked into clicking the jobs link (despite recently deciding I wasn’t going to go down that rabbit hole again!) but fortunately/unfortunately it is in-person and daily so, so flying out from Chicago a week out of the month won’t work and I don’t even have to ask!

given their PLTR connection, probably not

Oh, suddenly I'm not interested, either! Thanks!

There are situations where no bugs is an important requirement, if it means no bugs that cause a noticeable failure. Things such as planes, submarines, nuclear reactors. For those there is provably correct code. That takes a long time to write, and I mean a really long time. Applying that to all software doesn't make sense from a commercial perspective. There are areas where improvements can have a big impact though, such as language safety improvements (Rust) and cybersecurity requirements regarding private data protection. I see those as being the biggest win.

I don't see no bugs in a distributed database as important enough to delay shipping for 5 years, but (a) it's not my baby; (b) I don't know what industries/use cases they are targeting. For me it's much more important to ship something with no critical bugs early, get user feedback, iterate, then rinse and repeat continually.

This is a false dichotomy though. The proposed approach here has a (theoretically) great cost to value ratio. Spending time on a workload generation process, and adding some asserts to your code is much lower cost than hand-writing tens of thousands of test cases.

So it's not that this approach is only useful for critical applications, it's that it's low-cost enough to potentially speed up "regular" business application testing.

A lot of people underestimate the power of QA. Yeah, it would be great if we could just perfectly engineer something out of the gate. But you can also just take several months to stare at something, poke at it, jiggle it, and fix every conceivable problem, before shipping it. Heresy in the software world, but in every other part of the world it's called quality.

Three thoughts:

1. It's a brilliant idea that came at the right time. It feels like people are finally losing patience with flaky software, see developer sentiment on: fuzzers, static typing, memory safety, standardized protocols, containers, etc.

2. It's meant to be niche. $2 per hour per CPU (or $7000 per year per CPU if reserved), no free tier for hobby or FOSS, and the only way to try/buy is to contact them. Ouch. It's a valid business model, I'm just sad it's not going for maximum positive impact.

3. Kudos for the high quality writing and documentation, and I absolutely love that the docs include things like (emphasis in original):

If a bug is found in production, or by your customers, you should demand an explanation from us.

That's exactly how you buy developer goodwill. Reminds me of Mullvad, who I still recommend to people even after they dropped the ball on me.

Thanks for your kind words! As I mention in this comment (https://news.ycombinator.com/item?id=39358526) we are planning to have pricing suitable for small teams, and perhaps even a free tier for FOSS, in the future.

Disclosure: Antithesis co-founder.

"It's meant to be niche. $2 per hour per CPU (or $7000 per year per CPU if reserved), no free tier for hobby or FOSS, and the only way to try/buy is to contact them. Ouch. It's a valid business model, I'm just sad it's not going for maximum positive impact."

This is the sort of thing that, if it takes off, will start affecting the entire software world. Hardware will start adding features to support it. In 30 years this may simply be how computing works. But the pioneers need to recover the costs of the arrows they got stuck with before it can really spread out. Don't look at this an event, but as the beginning of a process.

Imagine being proud of working for Palantir.

Your life depends on lots of unsavory tasks.

Talk about bad writing. If I don't know what the hell your thing is in the first paragraph, I'm not going to read your whole blog post to find out. Homepage is just as bad.

The article is more of a history lesson and context than it is an ad. I see what you mean, but clicking “product -> What Is Antithesis?” Shows a clear description of what it does. Perhaps that could also either be added to the article or the home page?

I appreciated this post. Separately from what they are talking about, I found this bit insightful:

// This limits the value of testing, because if you had the foresight to write a test for a particular case, then you probably had the foresight to make the code handle that case too.

I often felt this way when I saw developers feel a sense of doing good work and creating safe software because they wrote unit tests like expect add(2,2) = 4. There is basically a 1-1 correlation between cases you thought to test and that you coded for, which is really no better off in terms of unexplored scenarios.

I get that this has some incremental value in catching blatant miscoding and regressions down the road so it's helpful, it's just not getting at the main thing that will kill you.

I felt similarly about human QA back in my finance days that asked developers for a test plan. If the dev writes a test plan, it also only covers what the dev already thought about. So I asked my team to write the vaguest/highest level test plan possible (eg, "it should now be possible to trade a Singaporean bond" rather than "type the Singaporean bond ticker into the field, type the amount, type the yield, click buy or sell") - the vagueness made more room for the QA person to do something different (even things like tabbing vs clicking, or filling the fields out of sequence, or misreading the labels) than how the dev saw it, which is the whole point.

Checking their bug report which should contain "detailed information about a particular bug" I am not sure I can fully understand those claims: https://public.antithesis.com/report/ZsfkRkU58VYYW1yRVF8zsvU...

To my untrained eye I get: Logs, a graph of when in time the bug happened over multiple runs and a statistical analysis which part of the application code could be invovled. The statistical analysis is nice but it is completely flat, without any hierarchical relationships making it quite hard to parse mentally.

I kind of expected more context to be provided about inputs, steps and systems that lead to the bug. Is it expected to then start adding all the logging/debugging that might be missing from the logs and re-run it to track it down? I hoped that given the deterministic systems and inputs there could be more initial hints provided.

https://antithesis.com/images/people/will.jpg the look of the CEO is selling the software to me automatically. reliable and nice

Https bugs me.

This blazing fetish for security really has gone too far. A thousand geeks fell into their own asshole and a thousand corporatroids took advantage. We really do have better things to do than march in that parade.

"At FoundationDB, once we hit the point of having ~zero bugs and confidence that any new ones would be found immediately, we entered into this blessed condition and we flew. Programming in this state is like living life surrounded by a force field that protects you from all harm. Suddenly, you feel like you can take risks"

When this state hits it really is a thing to behold. Its very empowering to trust your system to this extent, and to know if you introduce a bug a test will save you.

I think there is a lot of opportunity for integrating simulation into software development. I'm surprised it isn't more common though I suppose the upfront investment would scare many away.

Great read. Great product. I've been an early user of Antithesis. My background is dependability and formal distributed systems.

This thing is magic (or rather, it's indistinguishable from magic ;-)).

If they told me I could test any distributed system without a single line of code change, do things like step-by-step debugging, even rollback time at will, I would not believe it. But Antithesis works as advertised.

It's a game-changer for distributed systems that truly care about dependability.

In my career I learned two powerful tools to get bug free code. Design by Contract and Randomized testing.

I had to roll this by myself for each project I did. Antithesis seems to systematize it and created great tooling around it. That's Great!!!

However, looking at their docs they rely on assertion failures to find bugs. I believe Antithesis has a missed opportunity here by not properly pushing for Design by Contract instead of generic use of assertions. They don't even mention Design by Contract. I suspect the vast majority of people here on HN have never heard of it.

They should create a Design by Contract SDK for languages that don't have one (think most languages) that interacts nicely with tooling and only fallback to generic assertions when their SDK is not available. A Design by Contract SDK would provide better error messages over generic assertions, further helping users solve bugs. In fact, their testing framework is useless without contracts being diligently used. It requires a different training and mindset from engineers. Teaching them Design by Contract puts them in that frame of mind.

They have an opportunity to teach Design by Contract to a new generation of engineers. I'm surprised they don't even mention it.

Was an eaaaaaaaarly tester for this. Pretty neat stuff.

Happy customer here —— maybe the first or second? Distributed systems are hard; #iykyk.

Antithesis makes them less hard (not in line an NP hard sense but still!).

Not directly related to this post, but clicking around the webpage I chuckled seeing Palantir's case study/testimonial:

https://antithesis.com/solutions/who_we_help

There’s a straightforward way to reach this testing state for optimization problems. Write 2 implementations of the code, one that is simple/slow and one that is optimized. Generate random inputs and assert outputs match correctly.

I’ve used this for leetcode-style problems and have never failed on correctness.

It is liberating to code in systems that test like this for the exact reasons mentioned in the article.

It’s pretty weird for a startup to remain in stealth for over five years.

Not really. I have friends who work for a startup that's been in "stealth" for 20 years. Stealth is a business model not a phase.

"I love me a powerful type system, but it’s not the same as actually running your software in thousands and thousands of crazy situations you’d never dreamed of."

Would not trust. Formal software verification is badly needed. Running thousands of tests means almost nothing in software world. This is math, not a fastfood joint. Don't fool beginners with your test hero stories.

Business value is a good way to think about it:

As a software developer, fixing bugs is a good thing. Right? Isn’t it always a good thing?

No!

Fixing bugs is only important when the value of having the bug fixed exceeds the cost of the fixing it.

https://www.joelonsoftware.com/2001/07/31/hard-assed-bug-fix...

Why are all the cool people working on DBs and talking about Paxos?

Could this work for embedded C projects? Bare metal or RTOS?

This looks to be an incredible tool that was years in the making. Excited to see where it goes from here!

sundar pichai is probably too busy reading techmeme to even know this exists for the next 6 months. they should re-name android to 'cockroach farm'.

I got similar productivity boosts after learning TLA+ and Alloy.

Simulation is an interesting approach but I am curious if they ever implemented the simulation wrong would it report errors that don't happen on the target platform or fail to find errors that the target platform reports? How wide the gap is will matter... and how many possible platforms and configurations will the hypervisor cover?

This "no bugs" maximalism is counterproductive. There are many classes of bugs that this cannot hope to handle. For example, let's say I have a transaction processing application that speaks to Stripe to handle the credit card flow. What happens if Stripe begins send a webhook showing that it rejected my transactions but report them as completed successfully when I poll them? The need to "delete all of our dependencies" (I presume they wrote their own OS kernel too?) in FoundationDB shows that upstream bugs will always sneak through this tooling.

Looks like this coincides with seed funding[1], congrats folks! Did you guys just bootstrap through the last 5 years of development?

[1] https://www.saltwire.com/cape-breton/business/code-testing-s...

I met Antithesis at Strangeloop this year and got to talk to employees about the state of the art of automated fault injection that I was following when I worked at Amazon, and I cannot overstate how their product is a huge leap forward compared to many of the formal verification systems being used today.

I actually got to follow their bug tracking process on an issue they identified in Apache Spark streaming - going off of the docs, they managed to identify a subtle and insidious correctness error in a common operation that would've caused headaches in low visibility edge case for years at that point. In the end the docs were incorrect, but after that showing I cannot imagine how critical tools like Antithesis will be inside companies building distributed systems.

I hope we get some blog posts that dig into the technical weeds soon, I'd love to hear what brought them to their current approach.

one of the best applications yet of AI in cyber

The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.

49 years ago, a man named Fred Brooks published a book, wherein he postulated that adding people to a late software project makes it later. It's staggering that 49 years later, people are still discovering that having a larger engineering team does not make your work more productive (or better). So what does make work more productive?

Productivity requires efficiency. Efficiency is expensive, complicated, nuanced, curt. You can't just start out from day 1 with an efficient team or company. It has to be grown, intentionally, continuously, like a garden of fragile flowers in a harsh environment.

Is the soil's pH right? Good. Is it getting enough sun? Good. Wait, is that leaf a little yellow? Might need to shade it. Hmm, are we watering it too much? Let's change some things and see. Ok, doing better now. Ah, it's growing fast now. Let's trim some of those lower leaves. Hmm, it's looking a little tall, is it growing too fast? Maybe it does need more sun after all.

If you really pay attention, and continue to make changes towards the goal of efficiency, you'll get there. No need for a 10x developer or 3 billion dollars. You just have to listen, look, change, measure, repeat. Eventually you'll feel the magic of zooming along productively. But you have to keep your eye on it until it blooms. And then keep it blooming...