Random thought: Every IDE I've used gives me the folder structure of the project on the left as a standard directory tree. Does any support navigating a project as a graph of dependencies?
This approach sounds great as a low-maintenance model for open-source projects with many ad hoc contributors. For projects with dedicated engineers, consider ADRs instead. These require more maintenance, but capture the "why" and "alternatives considered", which can be immensely helpful when rearchitecting.
We have those pointless documents used by architects at my place of work, most of them have something on the lines of,
Micro services, Kafka, Kubernetes, because what if we have a billion users compared to the current 4k users.
GraphDB, what if SQL would not be enough
ElasticSearch, what if we have to do full text search along with stats.
But most of these documents are just short for any of these, "I want to try this new architecture/technology because it's fun, my colleague at a FAANG uses it, I've read the BOOK, and it looks good on my CV".
When they jump to design the next Big project, we have to deal with their decision, having twice as many services than our team, having to keep all the above mentioned DBs and technologies in sync which is of course a simpler problem than making those "big architecture decisions".
Sorry for the rant :(
Uh, sounds like these documents are the opposite of pointless, and are in fact working exactly as intended.
depends on the likelihood of those "what if" questions.
A bif chunk of good architecture is judgement: navigating the path between "building for the future" and YAGNI. It's much easier to end up at one end or the other, for example:
* k8s and microservices and sharded, cached data storage and Angular and NX in case we get to internet scale.... on our internal web app with max 400 users
* we're not gonna use a database because we don't know we'll need one... on an accounting system that needs to handle 1000s payments per hour.
I've seen both. I read GP's post as: we have ADRs that justify paying the complexity task when the balance of probability/evidence doesn't justify it. Good judgement is hard. It's one area where good devs/architects can make a really meaningful impact.
I was being a bit of a prick with my comment, I could've written it more clearly. Sorry for that. My point is this: what do you think is likely to happen if the requirement to write these documents is removed? a) In the absence of an impetus to justify architectural decisions, people cease creating overengineered architectures. b) People with an inclination to overengineer systems continue to do so, without documenting their reasoning. If (a) is plausible, maybe the documents are indeed pointless (or worse). I'm arguing that (b) is far more likely, and it's a strictly worse state of affairs than one in which decision records are written.
thanks for the reply and agree with it. Understanding the "why" - and "why not" - is very valuable. Those are the best code comments at any level of abstraction.
More concretely, if I was asked to consider rearchitecting a service that documented that it "used elastic search just incase full text search was needed" and that requirement never materialized, I'd feel much more comfortable dropping ES than if I was going in blind.
I don’t think “instead” is the correct word here. “As well” I think fits better.
ARCHITECTURE.md will have the current state of the architecture. ADRs is the log of decisions that got you there. Both are very useful.
If you maintain an open-source project in the range of 10k-200k lines of code, I strongly encourage you to add an ARCHITECTURE document
I like this idea but IMHO, regardless of repo size, architecture can still has some place in a Readme. For example, I purposely placed a Mermaid sequence diagram[1] in the main Readme because I think it's important that all readers see and understand its workflow[2]
[1] https://mermaid.js.org/syntax/sequenceDiagram.html
[2] https://github.com/hbcondo/revenut-app?tab=readme-ov-file#-w...
A natural language mermaid diagram builder would be really neat. Something to go think about...
ChatGPT had a good stab at it out of the box. It often only takes minor editing (just ask it to return mermaid)
It's wrong enough where writing raw mermaid is still faster at this time, IME.
One of the lessons I’ve learned is that the biggest difference between an occasional contributor and a core developer lies in the knowledge about the physical architecture of the project. Roughly, it takes 2x more time to write a patch if you are unfamiliar with the project, but it takes 10x more time to figure out where you should change the code.
Here here! This sounds like such awesome advice.
I wish we had better tools to visualize architecture of running systems. It's crazy to me that reading the code or a markdown file are still so state of the art. Maybe if someone's fancy they'll have some nice Mermaid diagrams. I want the architecture to be able to show itself live. Broadscale macroscale observability, baked in.
I think this would help everyone be able to appreciate & grok computing much more, would help humanity augment itself.
I'd like to see something like [dep-tree](https://github.com/gabotechs/dep-tree) 's visualization augmented with a keyword or LLM vector search. Your query would highlight relevant files and clusters.
That's pretty much what https://sourcegraph.com/ are selling, is it not? (granted, they don't visualize their graph quite so graphically)
My experience was that on every project I was onboarded I was shown such an architecture diagram with a brief explanation of its components.
Now I'm surprised how uncommon this is in open source.
The "explanation of its components" is the problem: someone needs to do it. Open source projects don't have an employee's first day, so they don't have this introduction.
They're not good if you don't have a sea of time, though. (An employee is expected to take a few weeks before getting anything done on their own.) I'm a security consultant, so we get to see a brand new one of these every two weeks and the problem with these explanations is that they are ad-hoc, unstructured, and mention lots of irrelevant details because the speaker has the curse of knowledge.
Perhaps someone new to the repository should write this thing once, after which it can just be maintained. Second best is to just have anybody write it down, taking a minute to think about what goes in there and what doesn't rather than doing it always on the fly, because as the author says:
this file should describe the high-level architecture of the project. Keep it short: every recurring contributor will have to read it. Additionally, the shorter it is, the less likely it will be invalidated by some future change.
The "explanation of its components" is the problem: someone needs to do it. Open source projects don't have an employee's first day, so they don't have this introduction.
Also a problem in my day $JOB the previous rockstar devs didn’t provide anything like this so it’s not a n open source thing it’s a the-code-are-the-docs mentality.
I've always found this to be a very useful practice. Many projects have a few core files (or packages / modules / whatever) where most of the changes happen. Being able to familiarize new contributors (or old returning ones) with those quickly really helps the startup time on a project.
I've added architecture files to projects at multiple jobs now [0], [1] and they've been well received. They're not perfect, but they're better than nothing.
[0]: https://github.com/zapier/zapier-platform/pull/324
[1]: https://github.com/stripe/stripe-cli/blob/master/ARCHITECTUR...
Many projects have a few core files (or packages / modules / whatever) where most of the changes happen
Is there any way to view this automatically on GitHub? E.g. some kind of file change heatmap.
A DEVELOPMENT or HACKING document should be enough for contributors, to cite the coding guidelines. When someone wants to contribute without understanding the code workflow, he should not contribute.
When someone wants to contribute without understanding the code workflow, he should not contribute.
There are definitely many projects where you don't need to understand every single aspect of the codebase to be able to contribute a meaningful fix or even a feature.
It benefits the contributor and the project if they have a simple pointer to e.g. where the application logic code is, or where the distribution-related code is, etc.
One particular aspect of project architecture I often see people do wrong: a failure to have clear dependency structure between directories (and, often, too much stuff within a directory). This is particularly common if there is a directory named something like "common", "util", or "misc" (I am not actually arguing against having such directories, only noting they are prone to confusion).
I developed the following rules, which can be automatically enforced if you explicitly write the ranks (don't try implicit ranking; that means you won't get sane errors when you violate the rules):
0. Every directory has an implicit dependency on all its contained children.
1. A directory that does not depend on any other directories has rank 0. It suffices to only use rank between sibling directories, and it's probably simplest to maintain, though global ranking does work.
2. A directory that depends on others has a rank of 1 + the highest rank among its dependencies. To ease refactoring you could loosen this to "has a rank that is strictly greater than the highest rank among its dependencies".
3. Thus, circular dependencies between directories are forbidden; refactor (preferably, by splitting directories; most projects are too merged already) until you have a DAG. (circular dependencies between files in a single directory are allowed, subject to language-specific caution)
4. Thus, it is forbidden for a subdirectory to depend on a parent (or ancestor). If you encounter this, move the relevant files to a new subdirectory (since depending on a sibling or uncle is okay; if rank only applies between siblings this means the parent has to add a dependency on the uncle so that it gets the correct rank).
5. Depending on a cousin (or nephew) directory should be treated as a dependency on that cousin's parent, though (depending on what amount of directory structure your language forces you to use) it may be a hint you're doing something wrong.
6. Each directory can produce at most one library (shared and static count as the same library) or executable (at least, user-facing ones; code generators and tests might not count). Note that for other reasons it's generally inadvisable to ship multiple shared libraries or multiple static libraries, though you might use them during development.
7. (YMMV) If a directory contains any generated files (or their inputs), it should not contain any other files. Note that there are at least 3 major workflows for generated files, so the details will vary, but isolating them is useful regardless.
Note again: this is both coarser and finer than build dependencies - we treat directories as units, but add conceptual dependencies. As a general rule, I find it useful to define that the client (which calls `connect`) depends on the server (which calls `listen` and `accept`), and/or the data consumer depends on the data producer. Admittedly I have not deeply considered the case of servers that are worker-like, but note that it is often still possible to satisfy both by splitting directories further.
Note that additionally defining a "weight" (1 + weight of dependencies), although possible, is not particularly useful at the directory level. Long chains are easier to understand than tangled messes, but have a higher weight, and we don't want to discourage splitting a directory into a chain.
=====
In my experience, a project of about 100 kLoC organized itself about 20 immediate subdirectories of src/ with a max rank of about 10. This was roughly follows:
0-1: core support for the language, compiler, and replacements for the standard library; these almost never change. If somehow you have other kinds of code that doesn't depend on these, consider artificially inflating their rank to at least 2. About 1 each.
2-3: most fairly-project-agnostic (but not polyfill-like) "common"-like directories. About 2-3 each, probably.
4-6: semi-project-specific common stuff, most executables/libraries. IME generated files tend to belong here, and may be responsible for splitting a directory into a chain of length 3. There are a lot of these, usually with 1-2 dependencies of the previous rank and several more of the rank before that; likely there is nothing that depends on the entirety of any previous rank.
7-9: dependencies of the most complicated executable. About 1 each, a single chain at this point, though they might still have low-rank dependencies. At this point there's a decent chance that a directory might depend on the entirety of rank 2.
10: the most complicated executable (at least for me, this was the executable that had a conceptual dependency on other executables, even if it didn't have a source dependency. If you have a simple executable that conceptually depends on a complicated one, the simple one would be the highest rank instead). Note that only one of its dependencies is a 9; the rest are in the 3-6 range.
An alternate structure with deeper nesting, which I considered but never bothered to implement:
src/base/ - the directories ranked 0-1 above
src/lib/ - everything that either is part of a shared library, or is used by multiple executables (all directories ranked 2-3, and some ranked 4-6)
src/each-executable/various/ - dependencies of a particular executable (including all directories ranked 7-9 above)
src/each-executable/main/ or just src/each-executable/ - the file containing `main` and as little else as possible (otherwise we're likely to confuse "code closely related to `main`" and "code that is executable-specific but not involved in many dependencies")
At top level, I never bothered to formalize it, but it was basically a single chain (though exactly what a "dependency" is not quite as clear here):0: scripts/ - executable scripts used during the build that don't have to be built, or that might be installed with no more than a shebang update.
1: src/ - all source code, including that for tests and tool/
2: tool/ - built executables needed for later parts of the build. Refactoring to split this out from shebang scripts is nice for your sanity, but may be noisy, especially if your build system's dependencies are sloppy. In contrast to build/ these are never cross-compiled.
3: build/ (all other output (data and potentially-cross-compiled code); contains bin/, lib/, and share/ at least)
Hopefully, projects that get much bigger than this can be split into further subdirectories so that the dependency ranking only need be done separately within them. I'm not sure if any big project is actually that nice in whole, but you should at least be able to create sanity in part without too much distress.
That said, keep in mind that this is just one approach to this particular problem, and it is just one architecture-related problem. Particularly, if your tooling makes it difficult to split directories (or files for that matter), fix your tooling first (related: "recursive make considered harmful")
Very interesting approach, thanks for taking the time to write it! I'll be evaluating my own projects with these guidelines- I think I've structured some of them like this without thinking about WHY
I experimented with something similar in one of my larger side projects a couple of years ago:
https://github.com/shipmight/shipmight/blob/master/src/ARCHI...
At the top of each file there was a tree of links to other ARCHITECTURE.md-files in the repo, like this:
* ARCHITECTURE.md <- you are here
* backend/ARCHITECTURE.md
* backend/api/ARCHITECTURE.md
* backend/cli/ARCHITECTURE.md
* backend/ui/ARCHITECTURE.md
* backend/utils/ARCHITECTURE.md
* frontend/ARCHITECTURE.md
* internal-charts/ARCHITECTURE.md
A README.md in each package/module/top-level directory is also rendered by GitHub's UI in the file list at the bottom (in locations like this: https://github.com/shipmight/shipmight/tree/master/src/backe...), I've seen this in some projects as well.
I would be wary of extrapolating what the author is writing of here to general software projects. I think it makes a lot of sense on large open source projects, where there are many contributors with little context- it is worth the effort to maintain such a document in this case. But all the developer committed documentation I've seen on smaller work projects have inevitably become unmaintained.
I've done a few Architecture sessions with teams. It's always valuable. If anything, you'll learn that people in the team have very different ideas about the current and the ideal architecture. Making that explicit, alone, is worth a document.
And also: "documentation become unmaintained" is a very poor argument to not make documentation. Because any documentation, even outdated or subtly wrong one, is better than "no documentation".
I think that's just more documentation to read which becomes outdated (read: lies) at the point in time someone moves code around with refactoring tooling in their IDE.
What about aspiring to "screaming architecture" instead? Don't hide your application domain in a "crates" directory. Do it the other way around.
Rip_hoan903
I’m 100% on board with this. Even if it’s just a see docs within the repo. I should be able to understand the intent of things without having to go read your website (should you’d remember to write one).
ASCII diagrams of components or your draw.io diagrams would go here. Knowing what you have is half the battle.
I love the idea here, but i loath the example. TMI. Give the core idea, any communication points (sockets, apis etc), maybe an abstract diagram.
I was a fan of all these little docs/diagrams-as-code standards:
- README-driven development
- ARCHITECTURE.md
- ADRs
- arc42
- C4
- etc.
Now I just put Obsidian vault inside the /docs folder of the git repo.Instead of using somebody else’s standard, I just organize and refactor docs as I go, in the same way as I manage my personal notes in Obsidian.
Initially I wanted to use a common subset of Markdown that will work both in GitHub (GFM) and Obsidian, but then I gave up, and just use Obsidian flavor of markdown with all its proprietary features like Dataview plugin, templates, etc.
Mermaid and LaTeX is built-in in Obsidian, and there is a plugin for PlantUML.
For visual drawings/diagrams there are builtin Canvas, DrawIO and Excalidraw.
This sounds like it's suggesting adding documentation of the major units, what their purposes are, as well as their interfaces.
I agree with this. A map of the code is great. It's like an exploded-view drawing of a mechanical component. It helps highlight what goes where and connects how to what.
...but is this what "architecture" means? I was under the impression that architecture went beyond major-units-and-their-interfaces and had more to do with the decisions and assumptions that lead to those specific units and interfaces -- the why behind the what.
In other words, the "architecture", in my view, is the thing we might go against when we refactor things. Not because names and interfaces change, but because the rationale for having things a certain way might still apply only we don't know about it because those assumptions and decisions -- the architecture -- is rarely documented and still would not be under this proposal.
I try to write docs like this. (Not markdown files, heaven forfend, and definitely not a random file in the root, of course, but these are quibbles.)
A problem, though, is that many things lack an architecture; they were grown and are a mess, and the procedural knowledge can't be put to a document without just pointlessly recapitulating the code itself in what is possibly a more confusing form.
the shorter it is, the less likely it will be invalidated by some future change. This is the main rule of thumb for ARCHITECTURE — only specify things that are unlikely to frequently change. Don’t try to keep it synchronized with code.
Interfaces are less likely [and harder!] to change. (On the criteria to be used in decomposing systems into modules, Parnas).I agree it is the difficulty in grokking a codebase. "Pattern" naming sort of helps, but I end up having to read a far bit.
On github, I always keep thinking the commit messages for each file are descriptions. Would that be more useful?
I don't know, man. I think README.md works just fine.
Discussed at the time:
Architecture.md - https://news.ycombinator.com/item?id=26048784 - Feb 2021 (153 comments)
In my company we called it `COMPASS.md`
Compass cause it helped you navigate. It's short. Every 'module' needed to have such a file. It was also enforced in the CI.
What would that practically look like? How would circular dependencies be resolved, for example?
The two usual ways I've seen it is either by following the code/control-flow (calls or inverted as called by) or by following the data flow. You can select any code function and see the call (or called) graph (shown as a tree), similarly for any data element and see the data elements that use (or is used by) graph and pruning cycles.
Call hierarchy trees usually cut cycles with a "there is already a node for this method elsewhere in the tree" end stop.
Is this meant so say it's good, bad, or just is? (which is what I meant by pruning cycles).
I'm OK with this behavior :)
As far as I'm concerned, if you have circular dependencies between directories, you're doing something wrong (see also my top-level comment).
If you're sane and have a DAG of directories, you can just toposort.
In rust I have a file that defines a struct and its implementations, then I have another file that has a static array of elements of that struct. But in the struct file, one of the implementations is a TryFrom<usize>, (which is ran when you have a variable of type usize and "cast" it into my struct) this TryFrom implementation returns the value of the nth element of the static array in the second file. I don't see anything wrong in having this circular dependancy.
Potentially I could extract the TryFrom implementation into a 3rd file, breaking the circle, but tbh that feels like I'm doing that just for the sake for doing that, and it offers no real benefit.
In this case I see a benefit in keeping the struct and its implementation in a file, and another file with a static variable (which btw is around 600 lines, yeah it's a big array) in a separate file.
Don't follow rules blindly and try not to have absolute rules in your life, it'll make things simpler and more flexible.
Two nodes, and either two edges with one arrow each, or one edge with two arrows.
¯\_(ツ)_/¯
Maybe you want a multitree.
See for example https://adrenaline.ucsd.edu/kirsh/Articles/In_Process/MultiT...
I have the exact same thought.
What would it look like? I have two ideas.
1. Multiple directory trees that use symlinks to organize files orthogonally. Your typical directory structure may have things split by client / server. But what if I want to split things based on feature? An IDE could make this a lot easier.
2. Along the lines of this post, I’d love if an IDE would make it easier to create bookmarks and navigate between them to walk people through the code. I’d love to leave a comment sometimes that I can click to jump me to another location in the codebase. Stringing these together leads allows you to weave a narrative throughout the codebase to explain how things work!
Is anyone working in these kinds of things??