return to table of content

Stripe's Monorepo Developer Environment

aidos
33 replies
5h11m

Maybe a silly question, but why all this engineering effort when you could host the dev environment locally?

By running a Linux VM on your local machine you get a consistent environment that you can ssh to, remove the latency issues but you remove all the complexity of syncing that they’ve created.

That’s a setup that’s worked well for me for 15 years but maybe I’m missing some other benefit?

yeswecatan
10 replies
5h10m

I came to ask the same thing. We use docker-compose to describe all our services which works fine.

JasonSage
9 replies
5h7m

This does not scale to a large number of services with a certain amount of RAM/processing per service.

recroad
5 replies
3h32m

If you have 100 services in your org, I don't have to have 100 running at the same time in your local dev machine. I only run the 5 I need for the feature I'm working on.

JasonSage
1 replies
1h47m

Your success with this strategy correlates more strongly with ‘Go’ than ‘100 services’ so it’s more anecdotal than generally-acceptable that you can run 100 services locally without issues. Of course you can.

Buying the biggest MacBook available as a baseline criteria for being able to run a stack locally with Docker Compose does not exactly inspire confidence.

At my last company we switched our dev environment from Docker Compose to Nix on those same MacBooks and CPU usage when from 300% to <10% overnight.

ikety
0 replies
1h14m

Have any details on how you've implemented Nix? For my personal projects I use nix without docker and the results are great. However I was always fearful that nix alone wouldn't quite scale as well as nix + docker for complicated environments.

I've used the FROM SCRATCH strat with nix:

https://mitchellh.com/writing/nix-with-dockerfiles

Is that how you implemented it?

Daishiman
1 replies
2h58m

I've been on this path and as soon as you work on a couple of concurrent branches you end up having 20 containers in your machine and setting these up to run successfully ends up being its own special PITA.

layer8
0 replies
41m

What exactly are the problems created by having a larger number of containers? Since you’re mentioning branches, these presumably don’t have to all run concurrently, i.e, you’re not talking about resource limitations.

aidos
2 replies
4h58m

You could still run the proxy they have that lazy boots services - that’s a nice optimisation.

I don’t think that many places are in a position where the machines would struggle. They didn’t mention that in the article as a concern - just that they struggled to keep environments consistent (brew install implies some are running on osx etc).

sulam
1 replies
3h40m

I think it’s safe to assume that for something with the scale and complexity of Stripe, it would be a tall order to run all the necessary services on your laptop, even stubs of them. They may not even do that on the dev boxes, I’d be a little surprised if they didn’t actually use prod services in some cases, or a canary at any rate, to avoid the hassles of having to maintain on-call for what is essentially a test environment.

aidos
0 replies
2h50m

I don’t know that’s safe to assume. Maybe it is an issue but it was not one of the issues they talk about in the article and not one of the design goals of the system. They have the proxy / lazy start system exactly so they can limit the services running. That suggests to me that they don’t end up needing them all the time to get things done.

n0us
6 replies
4h55m

You're limited by the resources available to you on your local laptop and when you close that laptop the dev environment stops running. Remote dev environments are more costly and complicated to maintain but they can be shared, can scale vertically (or horizontally) on demand, can persist when you exit them, and managing access to various internal services from dev environments can in some cases be simpler.

It also centralizes dev environment management to the platform team that owns them and provides them as a service which cuts down on support tickets related to broken dev environments. There are certainly some trade offs though and for most companies a local VM or docker compose file will be a better choice.

crabbone
2 replies
3h44m

Not even once did I want to share my dev. environment, nor did anyone want to share mine. We are talking about 25-odd years of being a developer.

Never in my life did I want to scale my dev. environment vertically or horizontally or in any other direction. Unless you work on a calculator, I don't know why would you need that.

I have no problems with my environment stopping when I close my laptop. Why is this a problem for anyone?

For overwhelming majority of programming projects out there they fit on a programmer's laptop just fine. The rare exceptions are the projects which require very specialized equipment not available to the developers. In any case, a simulator would be usually a preferable way to dealing with this, and the actual equipment would be only accessed for testing, not for development. Definitely not as a routine development process.

Never in my life did I want development process to be centralized. All developers have different habits, tastes and preferences. Last thing I want is to have centralized management of all environments which would create unwanted uniformity. I've been only once in a company that tried to institute a centrally-managed development environment in the way you describe, and I just couldn't cope with it. I quit after few month of misery. The most upsetting aspect about these efforts is stupidity. These efforts solve no problems, but add a lot of pain that is felt continuously, all the time you have to do anything work-related.

otabdeveloper4
0 replies
3h22m

For overwhelming majority of programming projects out there they fit on a programmer's laptop just fine.

What? No. You live in a very sheltered world, my friend.

marcosdumay
0 replies
5m

I get a serious feeling that interpreted languages, monorepos, environment orchestration, snapshot ecosystem aggregators, and per-function execution evironments are all pushing software development into the wrong direction.

Those things are not bad by themselves. But people tend to do bad things with them, and those bad things spread remarkably well, disrupting every place they infect.

underdeserver
1 replies
4h2m

Most local laptops are much stronger than is needed to run the entire stack of your average startup with no resource issues.

And the dev environment stops running when you close the laptop, but you also don't need it since you're not developing.

Not saying it can work for absolutely all cases but it's definitely good enough for a lot of cases.

anthonypasq
0 replies
3h59m

... this is an article about Stripe, not your average startup

giido
0 replies
4h31m

Also tends to security advantages to mitigate/manage dev risks. Typically hosts will have security tooling installed (AV, EDR, etc) that may not be installed on local VMs, hosts are ephemeral so quickly created and destroyed, network restrictions, etc.

bhuga
6 replies
4h13m

I work on this at Stripe. There's a lot of reasons:

* Local dev has laptop-based state that is hard to keep in sync for everyone. Broken laptops are _really hard_ to debug as opposed to cloud servers I can deploy dev management software to. I can safely say the oldest version of software that's in my cloud; the laptops skew across literally years of versions of dev tools despite a talented corpeng team managing them.

* Our cloud servers have a lot more horsepower than a laptop, which is important if a dev's current task involves multiple services.

* With a server, I can get detailed telemetry out of how devs work and what they actually wait on that help me understand what to work on next; I have to have pretty invasive spyware on laptops to do the same.

* Servers in our QA environment can interact with QA services in a way that is hard for a laptop to do. Some of these are "real services", others are incredibly important to dev itself, such as bazel caches.

There's other things; this is an abbreviated list.

If a linux VM works for you, keep working! But we have not been able to scale a thousands-of-devs experience on laptops.

aidos
4 replies
3h1m

I want to double check we’re talking about the same thing here. I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.

I’m sure there are a bunch of things that make it the right choice for Stripe. Obviously if you just have too many things to run at a time and a dev laptop can’t handle it then it’s a dealbreaker. What’s the size of the cloud instances you have to run on?

bhuga
2 replies
2h30m

I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.

I don't think there's confusion. I only have total access when the VM is provisioned, but I need to update the dev machine constantly.

Part of what makes a VM work well is that you can make changes and they're sticky. Folks will edit stuff in /etc, add dotfiles, add little cron jobs, build weird little SSH tunnels, whatever. You say "I can know versions", but with a VM, I can't! Devs will run update stuff locally.

As the person who "deploys" the VM, I'm left in a weird spot after you've made those changes. If I want to update everyone's VM, I blow away your changes (and potentially even the branches you're working on!). I can't update anything on it without destroying it.

In constrast, the dev servers update constantly. There's a dozen moving parts on them and most of them deploy several times a day without downtime. There's a maximum host lifetime and well-documented hooks for how to customize a server when it's created, so it's clear how devs need to work with them for their customizations and what the expectations are.

I guess its possible you could have a policy about when the dev VM is reset and get developers used to it? But I think that would be taking away a lot of the good parts of a VM when looking at the tradeoffs.

What’s the size of the cloud instances you have to run on?

We have a range of options devs can choose, but I don't think any of them are smaller than a high-end laptop.

aidos
1 replies
2h0m

So the devs don’t have the ability to ssh to your cloud instances and change config? Other than the size issue, I’m still not seeing the difference. Take your point on it needing to start before you have control, but other than that a VM on a dev machine is functionally the same as one in a cloud environment.

In terms of needing to reset, it’s just a matter of git branch, push, reset, merge. In your world that sync complexity happens all the time, in mine just on reset.

Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.

I have no doubt Stripe does what makes sense for Stripe. I’d also wager than on balance it’s not the best option for most other teams.

PS thanks for chiming in. I appreciate the extra insights and context.

bhuga
0 replies
9m

So the devs don’t have the ability to ssh to your cloud instances and change config?

They do, but I can see those changes if I'm helping debug, and more importantly, we can set up the most important parts of the dev processes as services that we can update. We can't ssh into a VM on your laptop to do that.

For example, if you start a service on a stripe machine, you're sending an RPC to a dev-runner program that allocates as many ports as are necessary, updates a local envoy to make it routable, sets up a systemd unit to keep it running, and so forth. If I need to update that component, I just deploy it like anything else. If someone configures their host until that dev runner breaks, it fails a healthcheck and that's obvious to me in a support role.

Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.

100% Agree! I think we've got something pretty cool, but this stuff is coming from a well-resourced team; keeping the infra for it all running is larger than many startups. There's tradeoffs involved: cost, user support, flexibility on the dev side (i.e. it's harder to add something to our servers than to test out a new kind of database on your local VM) come immediately to mind, but there are others.

There are startups doing lighter-weight, legacy-free versions of what we're doing that are worth exploring for organizations of any size. But remote dev isn't the right call for every company!

drited
0 replies
2h31m

I see in another comment thread you mentioned downloading the VM iso, presumably from a central source. Your comment in this thread didn't mention that so perhaps this answer (incorrectly) assumes the VM you are talking about was locally maintained/created?

hibikir
0 replies
31m

To provide historical context, 10 years ago there was a local dev infrastructure, but it was already so creaky as to be unreliable. Just getting the ruby dependencies updated was a problem. The local dev was also already cheating: All the asynchronous work that was triggered via RabbitMQ/Kafka was getting hacked together, because trying to run everything that Infra/Queues did locally would have been very wasteful. So magic occurred in the calls to the message queue that instead triggered the crucial ruby code that would be hit in the end.

So if this was a problem back then, when the company had less than 1000 employees, I can't even imagine how hard would it be to get local dev working now

dheera
2 replies
1h32m

By running a Linux VM

Or just run Linux on your local machine as the OS. I don't get the obsession with Macs as dev workstations for companies whose products run on Linux.

uncanneyvalley
0 replies
12m

The year of Linux on the laptop has yet to arrive for most of us. Windows and MacOS both offer better battery life, if for no other reason (and there are usually other reasons, like suspend/wake issues, graphics driver woes, etc.)

philwelch
0 replies
42m

Especially when they don’t even deploy to ARM servers.

simonw
1 replies
3h51m

In my opinion the single most important feature of any development environment is a reliable “reset” button.

The amount of time companies lose to broken development environments is incredible. A developer can easily lose half a day (or more) of productive time.

With cloud environments it’s much easier to offer a “just give me a brand new environment that works” button somewhere. That’s incredibly valuable.

aidos
0 replies
2h55m

For sure, but, a VM has that feature too. They have to run some services directly on the laptop to handle the code syncing. So if you accept a certain amount of “need to do some dev machine setup” as a cost, installing Parallels and running a script to download an iso is a pretty small surface area that allows for a full reset.

I don’t doubt that Stripe have a setup that works well for them them but I also bet they could have gone done a different path that also worked well and I suspect that other path (local VMs) is a better fit for most other smaller teams.

crabbone
1 replies
3h52m

Working in a configuration where your development environment isn't on your computer is always a huge downgrade. Work with VM? -- sooner or later you'll have problems with forwarding your keyboard input to the VM. Work with containers? -- no good way to save state, no good way to guarantee all containers are in sync etc. God forbid any sort of Web browser-based solution. The number of times I accidentally closed the tab or did something else unintentionally because of key mapping that's impossible to modify...

However, in some situations you must endure the pain of doing this. For example, regulatory reasons. Some organizations will not allow you to access their data anywhere but on some cloud VM they give you very botched and very limited control over. While, technically, these are usually easy to side-step, you are legally required to not move the data outside of the boundaries defined for you by the IT. And so you are stuck in this miserable situation, trying to engineer some semblance of a decent utility set in a hostile environment.

Another example is when the infrastructure of your project is too vast to be meaningfully reduced to your laptop, and a lot of your work is exploratory in nature. I.e. instead of typical write-compile-upload-test you are mostly modifying stuff on the system you are working on to see how it responds. This is kind of how my day-to-day goes: someone reported they fail to install or use one of the utilities we provide in a particular AWS region with some specific network settings etc. They'd give me a tunnel to the affected cluster, and I'd have some hours to spend there investigating the problem and looking for possible immediate and long-term solutions. So, you are essentially working in a tech-support role, but you also have to write code, debug it, sometimes compile it etc.

aidos
0 replies
7m

Sounds like you’re talking about something else (more like the Citrix / virtual desktop type model - I don’t know the name).

The idea here is that you use a VM (cloud or local) to run your compute. Most people can run it in the background without explicitly connecting to it.

trevor-e
0 replies
1h18m

From what I remember (left Stripe in late 2022) much of Stripe's codebase was/is a Ruby tangled "big ball of mud" monorepo due to lack of proper modules. Basically a lot of the core modules all imported code from each other with little layering so you couldn't deploy a lean service without pulling in almost all of the monorepo code. And due to the way imports worked it would load a ton of this code a runtime. This meant that even a simple service would have extremely high memory usage and be unsuitable for a local dev environment where you have N of these bloated services running at the same time. There was a big refactoring effort to get "strict modules" in place to cut down on this bloat which had some promising results. I'm not an expert in this area but I believe this was the gist of it.

domenkozar
21 replies
10h35m

We've been building https://devenv.sh for that reason, I expect more companies to go back to local development once they see DX has improved locally.

stavros
16 replies
9h19m

Nix is the right tool for this, developing a tool to make Nix's UX easier is a great idea. Thanks for this!

dirtbag__dad
8 replies
9h12m

What about dev containers?

stavros
4 replies
7h52m

You mean Docker? They tend to rot much more than I'd like, mostly because you forget to pin something at some point. With Nix, you can't forget.

1oooqooq
2 replies
5h22m

did the word rot change meaning recently?

pin is what causes rot, not what solves it.

otabdeveloper4
0 replies
3h20m

Good luck with your Docker containers in three years. (You're gonna need it.)

0x457
0 replies
47m

Different kind of rot. With nix and flakes, I can come back to a project 5 years later and as long as external dependencies (i.e. package sources) still available it will bring me back straight to that environment like it was yesterday.

If you have a Dockerfile from 5 years ago...well good luck building it today.

janjongboom
0 replies
7h8m

FYI, I've helped set up StableBuild (https://www.stablebuild.com) to help pin stuff in Docker that's normally virtually impossible to pin (e.g. OS package repos, Docker base images, random files from the internet, etc.)

1oooqooq
1 replies
5h23m

i missed any description of the actual container content on those examples.

0x457
0 replies
49m

IIRC, it uses what is defined for shell environment. Just instead of activating on your machine, it produces OCI image with that environment.

I have nixOS definitions that I can use to make a SD card image, overtake a running linux system via ssh, deploy to nixos via ssh, or deploy to a local system - all from one definition.

eadmund
6 replies
4h59m

Nix is the right tool for this

Or Guix, which has the advantage of a more pleasant language.

earthling8118
2 replies
4h40m

The language isn't the problem with nix.

nitsky
0 replies
31m

What is?

0x457
0 replies
53m

It's not "the problem", but it's a problem. It's better than alternatives, but it's hacky nature shows.

otabdeveloper4
1 replies
3h21m

That's, like, just your opinion, man.

Scheme and/or Lisp is literally the worst language choice for this problem domain.

0x457
0 replies
54m

I wouldn't say it's the worst. I don't like Lisp and co, but I think it's alright for this. I don't like Guix for a very different reason.

chpatrick
0 replies
3h47m

YMMV, I really don't like lisp braces personally.

evnix
3 replies
9h56m

How is this better or different from tools like dev which use docker

drakerossman
2 replies
9h41m

It (obviously) leverages Nix, which in turn means the environment is declarative and fully reproducible (not "reproducible" as in docker). Now, you can use just Nix's devShells, but with devenv you have a middleground between just Nix package manager and a full fledged NixOS module system. Basically, write out one line of code - and you've got your Postgres, another one - full linter set up for whatever language you're using, etc.

tmerse
1 replies
7h58m

Can I also get the security/isolation benefits that a duly configured docker/podman can provide (container can only act on mounted volume, non-root user, other seccomp settings?).

I feel better doing my "npm install"s in such an environment (of course it's still not a VM – but that's another topic).

When I read about nix, reproducibility is a goal, but security/isolation is a non-goal.

ParetoOptimal
0 replies
5h48m

You can generate fully reproducible OCI/docker containers with devenv, so yes I think.

https://devenv.sh/containers/

rvz
17 replies
9h11m

This isn't recommended practice really and there is nothing about this which justifies having to maintain huge code bases in a single folder or multiple folders in one larger one.

Won't be surprised to see that many would probably need a safari map or README documentation in every single folder to navigate a repository as large as stripes.

Sounds like an emergence of a new bad practice if you are having to praise how large your code base is.

pavlov
6 replies
9h5m

Meta also has a massive monorepo accessed primarily through cloud devservers.

When several of the world’s most successful software companies use this approach, it’s hard to argue that it’s inherently bad. Of course it’s sensible to discuss what lessons apply to smaller companies who don’t have the luxury of dedicated tooling teams supporting the monorepo and dev environment.

n_ary
4 replies
7h17m

Just because some successful companies use some approach doesn't make it the best practice. I have seen firsthand nuisance of monorepo, which took almost 15minutes to correctly switch branches on intel machines(and decently spiked the CPU by causing windows defender to panic). It has decent benefit of easy code sharing, but build and test are soul sucking experiences and if someone decides to run some updated formatter and linter rule accidentally, the whole MR becomes a nightmare to correctly review(once had a 2k+ changes and had to request to rollback and then only commit what they actually wanted to change).

tail_exchange
0 replies
5h17m

took almost 15minutes to correctly switch branches on intel machines

This can probably be fixed with trivial tuning. Just configuring Git to fetch only your branches would speed up the branch switching significantly.

build and test are soul sucking experiences

Why? It doesn't have to be. If you are going to build the entire monorepo, then yes, but this should only happen when you are running CI, and even then you can break down the builds into smaller components.

the whole MR becomes a nightmare to correctly review

Not if you set up code ownership properly. You also need to think what happens in case of emergencies, so having a selected list of "super users" and users with permissions to bypass reviews is important.

It sounds like this company wanted a monorepo, but nobody invested any money or time to actually think about developer productivity. When this happens, yes, of course it won't be good, because no project succeeds like this. The nice thing about a monorepo is that instead of 1,000 repos with tooling all over the place and no specialist to take care of them, you can have one repo with really good tooling and a team dedicated to just keep it running smoothly. But if nobody is actually taking care of the monorepo, it will rot just like any other codebase.

riwsky
0 replies
3h58m

“Someone autoformatted the whole thing under new settings at the same time as introducing a new feature” is hardly a monorepo problem. That could be a pain in the ass to review even in a single file. But the flip-side, of someone cleanly wanting to a do a mass autoformat or autorefactor, is much easier in a monorepo than in split repos.

kccqzy
0 replies
3h2m

Nothing you describe is inherent to monorepos. Git is slow yes but go use hg. Build and test are slow? That's a CI problem: you didn't allocate enough resources to the build system. Someone ran a formatter accidentally? That's that someone's mistake.

aidos
0 replies
5h17m

Why would you feel obliged to accept a MR in which someone has accidentally changed large amounts of code?

mootoday
0 replies
4h22m

Meta also uses React and we know what mess that introduced to the world...

lijok
6 replies
8h50m

Won't be surprised to see that many would probably need a safari map or README documentation in every single folder to navigate a repository as large as stripes.

No different to having thousands of smaller repos instead.

I personally dislike monorepos, for very niche, in-the-weeds operational reasons (as an infra person), but their ergonomics for DX cannot be understated.

__jonas
4 replies
8h17m

The 'ergonomics for DX' benefit is that you can share code across projects without having to go down the path of creating a package / library pushed to some internal registry and pulled by each project right?

Or are there any other aspects to the monorepo architecture that make it beneficial for large companies like that?

Just curious, I've never worked in such an environment myself.

triceratops
0 replies
4h20m

Dependency versioning is much smoother.

Example: Service A requires version 1.1 of libFoo and libFoo 1.1 requires version 0.1 of libBar. But Service A also directly uses libBar version 0.2. Now you have a conflict.

If libFoo and libBar are internal code stored in a monorepo they're automatically version-compatible because there is only one version of both.

oftenwrong
0 replies
37m

To put it in the most general terms: It provides the same value that using a VCS has for a project, but applied to the entire company.

In a standalone project, would you accept a change that is incompatible with other code in the project? For example, would you allow a colleague to change a function in a way that breaks the call sites? No, you probably would not.

The attitude within monorepo shops is that this level of rigour should be applied to the entire company. Nobody should be able to make a change anywhere if it would break anything elsewhere, or they should only be permitted to do so with intention. There are caveats to this, but that is the general idea.

dezgeg
0 replies
6h59m

In addition to what you mentioned, the ability to atomically commit to a library and all of its consumers. And for a change to a library run the tests of all of its consumers as well.

bastawhiz
0 replies
6h16m

Every host running a particular commit is running the code you think it is. No submodules or internal packages. If you updated the Button component in the design system, when your commit is deployed, every service that gets deployed has the new button now.

chrisweekly
0 replies
7h20m

understated -> overstated

papruapap
0 replies
8h16m

imo monorepos are great, but the tooling is not there, especially the open-sourced ones. Most companies using monorepos have their own tailored tools for it.

bastawhiz
0 replies
6h18m

Won't be surprised to see that many would probably need a safari map or README documentation in every single folder

Is...documentation a bad thing?

reillys
12 replies
10h21m

I chatted to Nelson when I was designing brisk (https://github.com/brisktest/brisk) and his insight informed the development of it.

Among other things, Brisk allows you to run tests for your local code changes in the cloud (basically the pay mini test piece but for any test runner)

We also have a sync step much like the one described here and allow users to run one off commands (linters, tsc etc)

IshKebab
10 replies
9h54m

Can't you achieve all that just using a build system with reliable remote builds & caching e.g. Bazel, Buck, Please, etc?

That also avoids hacky sync scripts.

reillys
9 replies
9h18m

No you can’t.

They don’t work from your local development env and also work in your CI env.

Mostly Brisk was designed to run your complete test suite on every codes save (ie local save) but it also works great from your CI.

We can run entire test suites in seconds which is performance you don’t get with those systems you named (which are generally for building/compiling)

Maxious
5 replies
8h54m

I'd suggest you revise your competitor analysis. Bazel definitely has a test command that with remote execution and caching absolutely allows you to run entire test suites in seconds* both locally and in CI eg. https://blog.aspect.build/typescript-with-rbe

reillys
4 replies
6h33m

This blog post says 2 and a half minutes not seconds.

I know Bazel is a build system which distributes builds among remote machines.

In fact using any computer language you can achieve these goals - you just need to program it.

So yes you could probably do all the things with all the things, but Basel does not solve this problem out of the box.

I wonder why stripe didn’t “just use Bazel”.

zrail
0 replies
6h9m

First release of Basel was in 2015 when Stripe was already 5 years old and the progenitor of this tooling was already running with several dozen users.

nkohari
0 replies
4h10m

Stripe does use Bazel. It just didn't exist before Stripe built some of its own internal systems, but it's gradually replacing ~everything from a build standpoint.

The one thing to know about Bazel is that it's both incredibly impressive, and also one of the least ergonomic pieces of software ever created. It's very clearly an internal project which was cleaned up and open sourced without any attempt to make it more usable outside of Google.

Bazel's kind of like Kubernetes in a way -- you don't actually get enough benefits to adopt it until you're at a certain point in the company lifecycle, and to get to that point you usually have to build other systems first. Then you have to gradually replace those systems with Bazel.

IshKebab
0 replies
6h23m

This blog post says 2 and a half minutes not seconds.

It's meaningless to say "we can run tests in seconds". You can't run my tests in seconds because they're single threaded and take 10 minutes. The important thing is the speedup, and they got a pretty good speedup. Arguably the nop build/test time is important too but it doesn't look like they measured that.

Basel does not solve this problem out of the box.

Yes it does.

I wonder why stripe didn’t “just use Bazel”.

In my experience it's because setting up Bazel is a) more work than setting up some ad-hoc build system (Make or CMake or whatever) and b) difficult to switch to retrospectively. So it only gets used where you have people who are experienced enough to know that you will wish you had started with it, and can convince the inexperienced people that it's worth the effort.

Usually you get too many inexperienced people saying "it's too difficult; we'll be fine with Make".

reillys
0 replies
9h17m

To be clear the sync step is used for the test suite execution not only the one off command running - it’s just something we can also easily do because we have a hot env in the cloud

joshuamorton
0 replies
8h55m

They don’t work from your local development env and also work in your CI env.

This is one of the biggest selling points of bazel-like build systems. Like to the extent that, for some changes, bazel can say "even though you changed this source file, I can be 100% certain that that change didn't affect any tests and so I will not run them"

IshKebab
0 replies
7h32m

They don’t work from your local development env and also work in your CI env.

Err yes they do? Unless you mean something really specific that I'm not getting?

riffraff
0 replies
6h27m

Brisk allows you to run tests for your local code changes in the cloud

how does this work for interactive debugging?

I was going to ask the same about the system in TFA but I might as well ask you :)

delhanty
10 replies
7h36m

Some caveats: It’s been nearly five years, and I have no doubt that I have misremembered some of the specific details, even though I’m confident in the overall picture. I’m also certain that Stripe has continued evolving and I make no claim this document represents the developer experience at Stripe as of today.

Are there any more recently ex-Stripe folks here willing and able to comment on how Stripe's developer environment might have evolved since the OP left in 2019?

nkohari
3 replies
4h32m

I spent 4.5 years at Stripe, and left in March.

The biggest difference not mentioned is the article is that code is no longer kept on developer machines. The sync process described in the article was well-designed, but also was a fairly constant source of headaches. (For example, sometimes the file watcher would miss an update and the code on your remote machine would be broken in strange ways, and you'd have to recognize that it was a sync issue instead of an actual problem with your code.) As a result, the old devbox system was superseded by "remote devboxes", which also host the code. Engineers use VSCode remote development via SSH. It works shockingly well for a codebase the size of Stripe's.

There are actually several different monorepos at Stripe, which is a constant source of frustration. There have been lots of efforts to try to unify the codebase into a single git repo, but it was difficult for a lot of reasons, not the least of which was the "main" monorepo was already testing the limits of the solution used for git hosting.

Overall, maintaining good developer productivity is an extremely challenging problem. This is especially true for a company like Stripe, which is both too large to operate as a "small" company and too small to operate as a "big" company. Even with a well-funded team of lots of super talented people putting forth their best efforts, it's tough to keep all of the wheels fully greased.

jcmfernandes
1 replies
4h8m

Thanks for this. Can you share the experience of those who don't use VS Code?

tail_exchange
0 replies
3h9m

IntelliJ is also supported. If you want to use something else, like VIM, then you need to ssh into the remote devbox machine. They have support for custom dotfiles, so you can set up your cool VIM environment for all your remote devboxes.

If you don't want remote devboxes, the regular devboxes still work. You just need to deal with the additional pain for syncing the files.

cynicalpeace
0 replies
2h54m

Glad to see that they moved to code living with the execution environment. The code living separate from the execution environment seemed like too much overhead and complexity for not enough benefit.

Especially given VSCode, or Cursor ;), work so well via ssh.

To the engineers that don't want to use those IDE's it might suck temporarily, but that's it.

bhuga
2 replies
3h6m

Some important differences from 2019:

* Code is off of laptops and lives entirely on the dev server in many (but not all) cases. This has opened up a lot of use cases where devs can have multiple branches in flight at once.

* Big investments into bazel.

* Heavier investment into editor experiences. We find most developers are not as idiosyncratic in their editor choices as is commonly believed, and most want a pre-configured setup where jump-to-def and such all "just work".

eikenberry
0 replies
1h13m

That last point has long been a red flag when interviewing. A developer who doesn't care about their tooling also tends to not care about the quality of their work.

cynicalpeace
0 replies
2h57m

I'm glad to see that first bullet point. The code living separate from the execution environment seemed like too much overhead and complexity for not enough benefit.

artyom
1 replies
5h53m

Not ex-Stripe but in "close relationship" with them since its inception and there's a clear mark in my calendar circa end of 2018 when their decisions and output started to become... weird, or ill-designed.

I don't think it has to do with the dev environment itself, but I'd blame such thing for allowing to deliver "too fast" without thinking twice. Combine that with new blood in management and that's an accident waiting to happen *

They're the best in business still, but far from the well-designed easy-to-use API-first developer-friendly initial offering.

* Pure speculation based on very evident patterns

rattray
0 replies
5h22m

Ex-Stripe ('17-'20) here. Agree.

Though I am under the impression that things have gotten more sensical internally over the last year or so.

Note also that the devprod team has largely been shielded from the craziness, and may still be making good decisions (but I don't know what they are in this realm personally).

chaosphere2112
0 replies
3h31m

I was only there in 2022, but at that point there were in fact three or more monorepos (forked roughly based on toolchain - go and scala in one, primarily Ruby in the one detailed here, and there was one for the client stripe api libs that was JS only. There may have been more.

KolmogorovComp
7 replies
9h57m

In addition, Stripe’s monorepo was (to our knowledge) the largest Ruby codebase in existence

Bigger than shoppify's?

spacemonkey92
3 replies
9h15m

I also wonder how they handle merge requests in a monorepo, especially when it comes to the code review process.

shepwalker
0 replies
3h21m

Hi! I work at Stripe on this. What're you curious about?

popinman322
0 replies
5h21m

It's possible to get stuck in merge hell where all your reviewers ok the PR but someone merged a conflict 2 seconds ago, or you've got a reviewer in Singapore while you're in SF and conflicts appeared overnight.

In general it was pretty rare, in my experience. The code bases were pretty well modularized.

azthecx
0 replies
8h51m

Typically you have owner files or similar in the subprojects that are read by automation tooling and humans alike

froydnj
1 replies
4h33m

The most recent publicly available numbers (that I know of, maybe there's a talk available somewhere that's more recent) are from https://stripe.com/blog/sorbet-stripes-type-checker-for-ruby

currently amounting to over 15 million lines of code spread across 150,000 files

The monorepo has only gotten bigger over the last two years (source: I work at Stripe).

froydnj
0 replies
2h27m

I should also note that number is Ruby files only.

Macha
0 replies
9h35m

So from a gut feeling that sounds right, finance is a pretty complicated domain with a lot of per vendor interactions, and Shopify outsources their payment stuff to Stripe.

Also on a headcount level, Google tells me Shopify has 3,500 employees to Stripe's 9,500. Obviously neither company is compromised entirely of engineers, so this is a ballpark estimate.

GitHub feels like the real case where there might be a larger codebase. It's in the middle for employees (6,500), but it's existed longer than Stripe (though not as much longer as my gut feeling told me, interestingly)

srvaroa
5 replies
7h59m

"This scale – the scale of devprod, and in turn the scale of the overall organization, such that it could afford 10 FTEs on tooling – was a major factor in our choices"

Is basically the summary for most mono/multi repo discussions, and a bunch of other related ones.

mhh__
1 replies
5h14m

Not sure.

I think a lot of this is just type of thing comes because with a monorepo you can actually see the problems to solve whereas you can easily end up with the same N engineers firefighting the same problems K times across all your polyrepos.

bluGill
0 replies
5h7m

You have different problems with both. Some problems are hidden in one, but there is no one best answer. (unless your project is small/trivial - which is what a lot of them are)

bluGill
1 replies
5h8m

It doesn't matter if you have a mono-rep or multi-repo, you will need engineers on tooling to make it work if your project is large. There are pros and cons to both multi-repo and mono-repo with no one right answer (despite what some will tell you). They are different pros and cons, but which is best depends on your particular context.

srvaroa
0 replies
4h30m

Yeah that was my point. In the end both approaches can be fine (depends on your context). The real difference is that whatever choice you take, it will need the right investment in tooling and support.

klodolph
0 replies
5h13m

Multirepo also comes with cost overhead. I think people talk about it somewhat less. I’ve worked at multirepo and monorepo places, both, before. My current company has a multirepo setup and it sure seems like it comes with plenty of tooling to fetch dependencies. That tooling has to be supported by FTEs.

jdtig
5 replies
6h27m

Does Stripe use RoR?

The author mentions the codebase was Ruby, but I didn't see if they talked about Rails.

bastawhiz
4 replies
6h21m

It is Ruby but not rails

jdtig
2 replies
6h11m

Thanks. I wonder what the experience is like working on a very large codebase with or without a framework. E.g. Stripe vs Shopify.

Or if the framework is barely noticeable at that scale and doesn't really matter anymore. That's the impression I get for Instagram (which was built with Django).

esprehn
0 replies
4h5m

At that scale there's certainly a framework and many in house libraries with opinions and patterns. It's just not rails.

bastawhiz
0 replies
3h59m

They had their own ORM, and a web framework built on Sinatra. It wasn't as though you needed to reach far for a tool if you needed one

jcmfernandes
0 replies
2h36m

Do they use zeitwerk?

Aeolun
5 replies
9h0m

They decided to keep the code on the local machine, but the language server on the remote one. That seems like a recipe for inconsistency. You only get relevant results from your language server once your code has synced.

bastawhiz
2 replies
6h19m

I was at Stripe until 2022 and inconsistency with the language server was never an issue

aidos
1 replies
5h2m

Due to the work that this team put in though, right?

The choice to run dev environment far away from the files puts you in the position of needing to engineer your way past the inconsistency.

bastawhiz
0 replies
3h57m

Yes, almost certainly.

On the other hand, there was so much code that running everything on your own laptop was essentially out of the question. Doing a git pull after a long vacation locked up your dev box for a hot minute while it checked all the types—doing the same thing on your MacBook would be painful at best.

paxys
0 replies
4h21m

The code syncs on every keystroke. Consistency isn't an issue unless you are having connection issues. And if you are then pretty much all development is broken anyways.

Hackbraten
0 replies
7h21m

The article mentions that the LSP itself already has baked-in support to enable editors to send chunks of unsaved edits to the language server (LS) as they happen.

What Stripe’s configuration introduced is that they used a remote LS instead of the default local LS. Regardless, VS Code already defers LSP communication until it feels idle, and developers are used to that. So I wouldn’t expect a remote LS to significantly impact the level of inconsistency that developers already accept when using a local LS.

p-o
3 replies
3h56m

It's always so enlightening to have articles like this one shed light on how companies at scale operate. It goes without saying that many of the problems Stripe faced with their monorepo isn't application to smaller businesses, but there are still bits and pieces that are applicable to many of us.

I've been working on an ephemeral/preview environment operator for Kubernetes(https://github.com/pier-oliviert/sequencer) and as I could agree to a lot of things OP said.

I think dev boxes is really the way to go, specially with all the components that makes an application nowadays. But the latency/synchronization issue is a hard topic and it's full of tradeoff.

A developer's laptop always ends up being a bespoke environment (yes, Nix/Docker can help with that), and so, there's always a confidence boost when you get your changes up on a standalone environment. It gives you the proof that "hey things are working like I expected them to".

draw_down
2 replies
3h32m

Right, dev boxes do not need to do double duty as a personal computer plus development target, which allows them to more closely resemble the machine your code will actually run on. They also can be replaced easily, which can be helpful if you ever suspect something is wrong with the box itself - if the new one acts the same way, it wasn't the dev box.

I don't recall latency being a big problem in practice. In an organization like this, it's best to keep branches up to date with respect to master anyway, so the diffs from switching between branches should be small. There was a lot of work done to make all this quite performant and nice to use. The slowest part was always CI.

tmpz22
1 replies
2h27m

I feel like we're not getting the right lessons from this. It feels like we're focusing on HOW we can do something versus pausing for a brief moment to consider if we SHOULD in the first place.

To me the root issue is the complexity of production environments has expanded to the point of impacting complexity in developer environments just to deploy or test - this is in conjunction with expanding complexity of developer environments just to develop - i.e. web pack.

For very large well resourced organizations like Stripe that actually operate at scale that complexity may very well be unavoidable. But most organizations are not Stripe. They should consider decreasing complexity instead of investing in complex tooling to wrangle it.

I'd go as far as to suggest both monorepos and dev-boxes are complex toolchains that many organizations should consider avoiding.

epinephrinios
0 replies
2h3m

Absolutely, I worked on tech behemoths and smaller companies. The dev experience was significantly better when all development was local. I even worked on initiatives to move development away from the cloud, and although other devs were skeptical, they ended up loving it.

physicsguy
2 replies
2h36m

I think for smaller companies, you can get a long way towards a lot of this with judicious use of docker-compose, and convenience scripts in a Makefile. As long as you don't do anything stupid like try and spin up 100 services when you're a team of 8, most laptops these days are sufficiently capable of handling a database, Redis, your codebase, and something like LocalStack.

PedroBatista
1 replies
1h15m

I would say you can even go a looong way without any Docker at all.

And for the large majority of the companies/projects, if your project is so complex and heavy of resources that it doesn't fit on a modern laptop, the problem is not in the laptop, it's in the whole project and the culture and cargo-cult around "modern" software development.

vlovich123
0 replies
32m

Containers/VMs are a nice way to isolate away any machine configuration discrepancies. Conversely it does encourage the use of non hermetic and deterministic build systems which come with other issues too (eg speed differences surfacing race conditions in the build)

mleo
2 replies
5h0m

I use syncthing to manage the synchronization of files between local laptop and remote development server. The software code base is upwards of 20 years and has dependencies on Windows for runtime. I can run unit tests locally on very fast MacBook Pro or run it much slower on Windows VM. With syncthing I can easily edit files locally or remotely and they are available locally for source control.

The worst problem is refining the ignore settings to ensure only code is synced preventing conflicts on derivative files and that some rule doesn’t overlap code file names.

shepherdjerred
0 replies
4h11m

I like Unison, though I found Mutagen a bit better.

https://mutagen.io/

secondcoming
1 replies
3h29m

What's the easiest way of sharing things like protobuf definitions across multiple separate repos and making sure things are always in sync?

MrDarcy
0 replies
1h23m

buf.build

truetraveller
0 replies
4h9m

"I’ve described a lot of fairly-involved custom tooling; we needed enough engineers to build and maintain it, and enough “customer” engineers for that investment to pay off."

This is so important when deciding to re-invent the wheel. I've gotten bitten by this many times.

pjmlp
0 replies
7h23m

Yet another replay of timesharing development experiences, I guess we need a couple of generations more to count how many times does a pendulum swing back and forth during a developer's lifetime.

mootoday
0 replies
4h23m

I've worked with remote dev environments for many years, including some time with one of the providers of such a service.

It became clear to me that cloud-only is not the way to go, but instead a local-first, cloud-optional approach.

https://mootoday.com/blog/dev-environments-in-the-cloud-are-...

crabbone
0 replies
3h27m

NB. What the article describes isn't a developer environment in the cloud. It's testing in the cloud. The editor in their model lives on the programmers' laptops, the editing happens there as well and so on. The code is deployed to cloud infrastructure for testing.

bool3max
0 replies
7h53m

Off-topic but the font on this blog is stunning - after some digging it seems to be "Vollkorn".

anonzzzies
0 replies
6h12m

We use similar practices in our 3.5 person team; we work via code-server and Aider with our own tooling on VPSs and this gets synced to execution VPSs which run dev versions, a lot of sentry logging and tests (mostly playwright these days). There is also a vps which does builds all day and logs to Sentry too. We can almost instantly get on our own test versions and see what we did, and, over the space of some seconds to minutes we see test and build data coming in. It works incredibly well for many years already. Onboarding people is easy and no one ever has 'it doesn't build on my system' as that's not something we do (you can of course, all scripts are there but why waste the time?).

I grew up with mainframes, minis and unix batch andor multiuser machines; for me this is the best way for business applications. I didn't particularly like the move to local all that much.

adamdecaf
0 replies
2h35m

We’ve been using a hundred repositories and a hundred Go services in a local docker-compose setup that’s worked fairly well. CI runners can struggle if their disks can’t keep up with Docker.

It comes up that we should make a devprod for front end folks to make the backend abstracted more.

Overall a lot of people prefer local dev because it gives them access to the entire stack, lets them run branch images easier, and has better performance than remote boxes.

https://moov.io/blog/education/moovs-approach-to-setup-and-t...