It's important to remember that all best practices are not created equal. I'd prioritize readability over DRY. I'd prioritize cohesion over extensibility. When people talk about best practices, they don't talk about how a lot of them are incompatible, or at least at odds with each other. Writing code is about choosing the best practices you want to prioritize as much as it's about avoiding bad practices.
Sometimes it's best to be DRY right from the start.
Several years ago, I did some contract work for a company that needed importers for airspace data and various other kinds of data relevant to flying.
In the US, the Federal Aviation Administration (FAA) publishes datasets for several kinds of airspace data. Two of them are called "Class Airspace" and "Special Use Airspace".
The guy who wrote the original importers for these treated them as completely separate and unrelated data. He used an internal generic tool to convert the FAA data for each kind of airspace into a format used within the company, and then wrote separate C++ code, thousands of lines of code each.
Thing is, the data for these two kinds of airspace is mostly identical. You could process it all with one common codebase, with separate code for only the 10% of the data that is different between the two formats.
When I asked him about this, he said, "I have this philosophy that says if you only have two similar things, it's best to write separate code for each. Once you get to a third, then you can think about refactoring and making some common code."
That is a good philosophy! I have often followed it myself.
But in this case, it was obvious that the two data formats were mostly the same, and there was never going to be a third kind of almost-identical airspace, only the two. So we had twice the code we needed.
I don't know, that sounds like a complex kind of ingest which could be arbitrarily subtle and diverge over time for legal and bureaucratic reasons.
I would kind of appreciate having two formats, since what are the odds they would change together? While there may never be a 3rd format, a DRY importer would imply that the source generating the data is also DRY.
In such case I think I'd go for an internal-DRYing + copy-on-write approach. That is, two identical classes or entry points, one for each format; internally, they'd share all the common code. Over time, if something changes in one format but not the other, that piece of code gets duplicated and then changed, so the other format retains the original code, which it now owns.
I like that approach.
I've had the mantra "inheritance is only for code reuse" and it's never steered me wrong.
Inheritance is only good for code reuse, and it’s a trick you only get to use once for each piece of code, so if you use it you need to be absolutely certain that the taxonomy you’re using it to leverage code across is the right one.
All ‘is-a so it gets this code’ models can be trivially modeled as ‘has-a so it gets this code’ patterns, which don’t have that single-use constraint… so the corollary to this rule tends towards ‘never use inheritance’.
Single use? No way that's what multiple inheritance and mixins are for. Inheritance being only for code reuse is explicitly about not creating a taxonomy. No more is-a just, "I need this code here." Hey this thing behaves like a mapping inherit from the MutipleMapping and get all the usual mapping methods for free. Hey this model needs created/updated_at, inherit from ChangeTracking and get those fields and helper methods for free.
Has-a doesn't make sense for code like the literal text reuse. It makes sense for composition and encapsulation.
Edit: I'm now realizing that Python has one of the only sane multiple inheritance implementations. It's no wonder the rest of y'all hate it.
It seems like OP is describing a shared interface, not necessarily inheritance.
I believe this very method is very common in games - you have similar logic for entities, but some have divergences that could occur in unknown ways after playtesting or future development.
Tho if done haphazardly by someone inexperienced, you might end up with subtle divergences that might look like they're meant to be copies, and debugging them in the future by another developer (without the history or knowledge) can get hard.
Then someone would wonder why there are these two very similar pieces of code, and mistakenly try to DRY it in the hopes of improving it, causing subtle mistakes to get introduced...
I prefer the FP approach of separating data and logic. you could end up with a box of functions (logic) that can be reused by the different "entities".
Last time i checked the FP world is slowly producing ECS frameworks that are needed to make the game performant. They used to be nearly C++ (or OO) exclusive.
This is really good advice and a great way to think about it.
In such case I think I'd go for an internal-DRYing + copy-on-write approach.
I agree. The primary risk of presented by DRY is tight coupling code which only bears similarities at a surface level. Starting off by explicitly keeping the externa bits separate sounds like a good way to avoid the worst tradeoff.
Nevertheless I still prefer the Write Everything Twice (WET) principle, which means mostly the same thing, but following a clear guideline: postpone all de-duplication efforts until it's either obvious there's shared code (semantics and implementation) in >2 occurrences, and always start by treating separate cases as independent cases.
I don't know. I've seen this approach for projects before go bad - people didn't want to DRY because they might diverge. Except they never did. Our 3rd+ scenarios we abstracted.
But what basically ended up happening was we had 2 codebases: 1 for that non-DRY version, and then 1 for everything else. The non-DRY version limped along and no one ever wanted to work on it. The ways it did things were never updated. It was rarely improved. It was kinda left to rot.
Why wasn't the original implementation swapped for the new one? The unwillingness/inability to do that seems to be most likely the core of the issues here?
The majority of our business was through the 1st implementation. Because of that it was the base we used to refactor into a more abstract solution for further scenarios. It was never deemed "worth it" to transition the 2nd non-DRY version. Why refactor an existing implementation if its working well enough and we could expand to new markets instead?
Yes, why do it? :p I mean, there are pros and cons - costs and benefits. And I can see both scenarios where it is better to spend the time on something else (that has better chance of bringing in money), and cases where it would be the right thing to do the cleanup (maybe original is just about to fall apart, or the new has straight up benefits to the business, or the act of doing it will greatly improve testing/QA in a critical area, etc).
Writing it DRY in the first place would also have costs, including the alternative costs. Would it have been better to take those there and then?
But what basically ended up happening was we had 2 codebases: 1 for that non-DRY version, and then 1 for everything else. The non-DRY version limped along and no one ever wanted to work on it. The ways it did things were never updated. It was rarely improved. It was kinda left to rot.
It sounds to me that you're trying to pin the blame of failing to maintain software on not following DRY, which makes no sense to me.
Advocating against mindlessly following DRY is not the same as advocating for not maintaining your software. Also, DRY does not magically earn you extra maintenance credits. In fact, it sounds to me that the bit of the code you called DRY ended up being easier to maintain because it wasn't forced to pile on abstractions needed to support the non-DRY code. If it was easy, you'd already have done it and you wouldn't be complaining about the special-purpose code you kept separated.
In my experience, once you copy code its bound to diverge, intentional or not. Bugs become features and you can never put the cat back in the bag without a monumental amount of work.
Undoing an abstraction is way easier. Eventually, they all turn bad anyways.
Good point. This may be a case where domain knowledge is helpful.
One of the reasons they brought me in on this project is that besides knowing how to wrangle data, I'm also an experienced pilot. So I had a good intuitive sense of the meaning and purpose of the data.
The part of the data that was identical is the description of the airspace boundaries. Pilots will recognize this as the famous "upside down wedding cake". But it's not just simple circles like a wedding cake. There are all kinds of cutouts and special cases.
Stuff like "From point A, draw an arc to point B with its center at point C. Then track the centerline of the San Seriffe River using the following list of points. Finally, from point D draw a straight line back to point A."
The FAA would be very reluctant to change this, for at least two reasons:
1. Who will provide us the budget to make these changes?
2. Who will take the heat when we break every client of this data?
I see, so it's a procedural language that is well understood by those who fly (not just some semi-structured data or ontology). This is a great example of the advantage of domain experience. Thanks for sharing!
a procedural language that is well understood by those who fly
That is a great way to describe it!
Of course it is all just rows in a CSV file, but yes, it is a set of instructions for how to generate a map.
In fact the pilot's maps were being drawn long before the computer era. Apparently the first FAA sectional chart was published in 1930! So the data format was derived from what must have been human-readable descriptions of what to plot on the map using a compass and straightedge.
I just remembered a quirk of the Australian airspace data. Sometimes they want you to draw a direct line from point F to point G, but there were two different kinds of straight lines. They may ask for a great circle, a straight path on the surface of the Earth. Or a rhumb line, which looks straight on a Mercator projection but is a curved path on the Earth.
You would often have some of each in the very same boundary description!
For anyone curious about this stuff, I recommend a visit to your local municipal airport and stop by the pilot shop to buy a sectional chart of your area.
Paper charts are great (they're fairly cheap and printed quite nicely in the USA at least) but you can get a good look at these boundaries through online charts.
https://skyvector.com is a good way to view these.
Thank you! I was trying to remember the name of that site and it slipped my mind. Yes, SkyVector is great.
I think if you know the domain well, it's not "premature" at all.
I vaguely recall Fred Brooks, in The Mythical Man-Month, using a somewhat similar situation, but involving the various US states' income tax rules and their relationship to the federal tax, as an example in order to make some sort of point (that point being 'know your data', IIRC.)
In a situation where there is a base model with specific modifications - which is, I feel, how airspace regulation mostly works - then I suspect that a DRY approach would make it easier to inspect and test, so long as it stays that way.
Sometimes it's best to be DRY right from the start.
3 things matter most in real estate: Location, location, location!
3 things matter most in programming: Context, context, context!
DRY – like almost every other programming tool/paradigm/principle – are very often misused by a lack of the programmer's ability to discern correctly whether the tool/paradigm/principle fits in the specific context.
It's not just a science. It's an art, too.
It's not just a science. It's an art, too.
It was... I will miss that until retirement, too, but the artisan part of this craft has been gradually dying for over a decade. I think this change started when "popular kids" started confidently saying they wanted to work with computers when they grew up. The effect is the proliferation of normies throughout the trade, now many of them with 10+ years of experience. The average developer's appreciation for the elegant and the inspiring grew weaker, and the idea of putting more work into a task than absolutely necessary (like, for example, stopping for a moment to consider the context before deciding on a tool or technique to apply...) lost all appeal. There's a thin line between aggressively pragmatic and ignorant, and the newer generations seem to treat crossing that line as a non-issue as long as the ticket can be presented as resolved. This mindset used to be confined to cubicles and neckties, but now it's seemingly everywhere...
Don't mind me; I just feel unusually old and grumpy today...
Come check out some of the array languages[1,2,3], or perhaps the retro-computing comfy vibes of a system like Decker[4]. Some of us still appreciate code as poetry.
[1]: https://mlochbaum.github.io/BQN/
Oh, I know! While I got discouraged trying to learn K (I was a little too hung up on the notion of free software back in the day...), I learned J and had a terrific time interacting with the community. There were some bad apples, but the basket labeled "Smalltalk" is filled with many marvels. I enjoyed Factor while it was actively developed, which led me deep into Forth-land - an unforgettable experience. By the time I arrived on the other side of s-expressions, Smug Lisp Weenies were out and lots of friendly, curious, intelligent folk lived there instead. More recently, I invested some time into learning Raku - a beautifully eclectic, shockingly expressive language whose development is severely understaffed, underfunded, and underappreciated. I had a great time in all those instances - I know there are passionate people approaching programming creatively in all kinds of shapes and forms, and it's indeed heartening.
The problem is, when I go to work, I see exactly none of those people among my coworkers. I feel like breaking down on the spot and declaiming:
I've seen things you people wouldn't believe... Attack ships on fire off the shoulder of Orion... I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain... Time to die.
I don't, mostly out of respect to Rutger Hauer. Anyway, while in absolute terms people like that are probably much more numerous now than 20 years ago, they feel much more distant to me than ever before.
Could you not just fire the people who are not like that and hire the people who are? If not, I think you might not have enough power in the organization! Seek more power!
Also consider finding a job where you need to know category theory to understand the system! That'll give you coworkers who are magical! And a little insane!
Ps. The "C" in "C-beams" now stands for "Category Theory"!
Yeah, the younger generation must just be wired up differently, it's all those popular kids, can't be anything to do with the proliferation of scrum, agile, and crunch that make them focused on doing all those tickets ;)
Trying to educate a boomer (new official term for a millennial) into not complaining about zoomers is like trying to convince a zoomer into not dying inside when they think they have committed an act of cringe.
It's best not to try! These are the cultural delusions that drive us!
If you're dealing with those kinds of people, make demands that they begin studying category theory. I'm talking the hard stuff - categories for the working mathematician, toposes triples and theories, sketches of an elephant, McCurdy's 2012 paper on graphical methods for Tanaka reconstruction, and Roman's 2019+ work on coend calculus and diagrammatic optics.
If they do it, if they actually learn categories, then you need to shut the f** up and let them live. They now have the power to do what they want. If they refuse to learn categories then you get to laugh at them and tell them that they are cringe, and fire them.
And if you don't have the power to just fire them for not learning categories, then you very likely need to learn categories yourself! And also gain the power to fire people, lol! There is no rational point to ever complain about a programmer, when you can just fire them instead.
So we had twice the code we needed.
Was that necessarily a bad thing and something that must be corrected for that code base?
I usually follow the same rule of thumb until I find myself repeatedly updating both at the same time. If I can't update one without updating the other then they must be the same thing and its time to DRY.
Don't Repeat Yourself when updating code.
Good points, thanks for bringing them up.
Yes, there was some ongoing maintenance of this code where both versions had to be updated. The original author was not a pilot and was unfamiliar with some of the nuances of FAA airspace. One of the reasons they brought me in was that I am a pilot and knew how the FAA's data should be interpreted.
In the end, not a huge deal, but it was annoying when I had to make the same changes in two places.
Knowing to DRY there depended on business knowledge that the original author did not have.
While they were wrong in this case, I would say it was a reasonable move to not DRY based on the code pattern itself at the time. And that's the big difference imo - DRYing based strictly on the structure of code vs business processes.
But this implies that you have to guess when and where to DRY, which basically implies that there's no good way but via experience and domain knowledge!
That's not what people want to hear - they want a silver bullet; a set of criteria for which DRY'ing could be determined from the onset!
Sometimes it's best to be DRY right from the start.
Never zealously adopt a programming practice that can be summarized in a headline!
I feel like the "sometimes" suggests that in most cases we should zealously not dry things, but in some cases we might want to. Doesn't that make you curious what cases those might be?
Yeah, don't be DRY from the start, YAGNI.
Did the repetition become obvious before or after you saw both implementations? It’s possible that if you dried it up right away, you would’ve abstracted the wrong thing, and it’s way more obvious only in hindsight.
This was some years ago, so my memory may be foggy - and I'm not instrument rated!
As I recall, it was after I saw the two implementations. I got curious and looked at the original FAA data and specifications and saw how much the two kinds of airspace have in common.
How was the dried code to own and maintain before and after? Was there pain before, and did it go away? Or did the operation prove to be more of a nitpick/bike shed? How long did you stick around with eyes on the code?
I find this is a case where different pipelines utilizing common functions in different compositions can be a great strategy. If something diverges and a function no longer makes sense in a pipeline, that’s not a big deal. Just pull it out and replace it with something bespoke that does the right thing.
I’ve had a lot of success with this in embedded settings where data is piped into storage or OTA, and I want to format and pack it/send it up consistently but I might want to treat the data itself slightly differently.
A related concept that IMO still aligns with DRY is that you should only avoid seeming code duplication when things are _semantically_ the same. No matter the mechanism (codegen, generics, macros, inheritance, ...), if you can't give a concept a meaningful [0] name then you usually shouldn't DRY it up with any mechanism. Your example is a technique I also use a lot, but the critical point is that you're choosing to break out functionality which _is_ easy to name.
[0a] More generally, I like a concept of "total" functions -- those which have sensible outputs for all their inputs. It's a bit of a tomayto/tomahto situation defining "all their inputs" (e.g., I'm personally okay using a function name like `unsafe_foo` and expecting a person to read the docs, and on the other extreme some people want sensible answers to anything the type system allows you to input), but the desired end-state is that when the project's requirements change you don't muck around with the ABI and implementation of `count_or_maybe_sort_for_these_three_special_customers_or_else_hit_the_db(...)`, or whatever much more generic and very wrong name the method actually has; the individual components are already correct, so you make the changes at the few methods which are actually wrong given the new requirements.
[0b] Another way of thinking about it is whether the two things should always change in tandem. For two largely overlapping beaucratic data formats? Maybe; there's a comment somewhere in this chain suggesting that they'll never go out of sync, but I'm a bit paranoid of that sort of thing. For the particular data structures that are currently shared by those formats? Absolutely not; if one diverges then you can build the new structure and link it in. The old structure is still valid in its own right.
Sometimes it's best to be DRY right from the start.
In your case I don't see how it couldn't be refactored later on once you have the domain knowledge. In our company we always have 2 types of projects planned. We can't always ship new features under type 1. And refactoring is something we must do under the 2nd type to reduce the tech debt. With 2 type of projects in mind we always plan the capacities for both types of projects
And that's how exactly how you should do dry. Once you see the repeated pattern, you should refactor it.
I'm in the finance factor and we are seeing this kind of 'similar but not the same' problem a lot. We are constantly in the process onboarding new payment use cases while doing refactors to abstract away the common patterns.
In some cases I would prefer to have two separate clear yet repetitive use cases, than to have, for example, a single abstract use case, that gets injected with two different factories at configuration time, depending on which sub case you want.
In that case, reading and maintaining two simple use cases might be less work than reading an abstract use case, backtracking to the available factories, and then mentally interpreting the injection and factory behavior.
Unless your abstractions really really make things just way simpler, being explicit could be better.
Another place where this repetition tends to help readability more than it hurts maintainability is in test cases. Often abstracting things out with a little test fixture is helpful. But then being obsessive about this ends up making tests harder to maintain since there's all this long distance coupling that you constantly have to maintain.
It seems that outside of test cases and use cases, we want to be much more diligent about DRY and picking the right abstractions - that makes logic in our use cases much simpler and more coherent. While inside the use case or test case, a little duplication of business logic is not so bad, and can actually improve the narrative of the code.
I'm a fan of the "Rule of Three", but another great rule is to not mindlessly follow rules.
Call me crazy, but it's almost as if these should be guidelines considered by a thinking person, with experience to help inform them, rather than hard-and-fast rules that must be applied to everything.
It is easy to come in after the fact and say this.
In reality: Had he DRYed it up from the beginning, you would probably have complained about a codebase that needs to corner cases into account deep in the code - the story had been turned around.
Yes, it is annoying to get into an existing code base.
We've got a large number of customer-specific file integrations, and a lot of them are indeed very similar as the customers have the same system on the other side. However almost all the time there's some tweaking needed. Customer A used field X for this but customer B used the field for that.
So if a new customer comes and need an integration to a system we already support, even if we think they'll start out being identical, we just copy the code.
Thing is, these things evolve. Suddenly we have to patch over some process-related issues in the other system for customer A, while customer B does not have that issue. Now we can fix A's integration without worrying at all about affecting B.
Of course we write library and helper functions, and use those actively throughout, so we only repeat the "top level" stuff.
This is just being hindsight oriented. The way it worked out it was better to DRY from the start, but the person implementing it didn't know how the chips would fall.
If you do what this guy did and are wrong, it looks like yes obviously it was the same thing and you should have DRY right from the start.
If you DRY from the start and are wrong, someone writes a blog post about not DRYing from the start.
It's a bet on which mistake is worse, because you will be wrong sometimes. DRYing from the start IMO is a worse mistake than duplication.
This happens when people follow rules mindlessly. 3 is an arbitrary number anyway. Even if 3 is the right number in most cases there will be cases where abstracting after 2 cases is best and others where abstracting after 4 cases is best, or any other number, really.
But you're refactoring an existing codebase so it's not prematurely dry. You know exactly how these classes diverge.
One of the #1 issues I’ve seen with DRY over the years seems to stem from a misunderstanding of what it means.
DRY is not just about code duplication, it’s about information/knowledge duplication, and code happens to be one representation of information.
Hyper focusing on code duplication quickly gets into premature optimization territory, and can result in DRYing things that don’t make sense. Focusing on information duplication leaves some leeway for the code and helps identify which parts of the code actually need DRY.
The difference is important, and later editions of the Pragmatic Programmer call this out specifically. But the concept of DRY often gets a bit twisted in my experience.
This is why some advice from Sandy Metz really stuck with me.
It is not a problem to /have/ the same code 2, 3 or even 4 times in a code base. In fact, sometimes just straight up copy-paste driven development can be a valid development technique. Initially that statement horrified me, but by now I understand that just straight up copy-pasting some existing code can be one of these techniques that require some discipline to not overdo, but it's legit.
And in quite a few cases, these same pieces of code just start developing in different directions and then they aren't the same code anymore.
However, if you have to /change/ the same code in the same way in multiple places, then you have a problem. If you have to fix the same bug in multiple places in similar or same ways, or have to introduce a feature in multiple places in similar way - then you have a problem.
Once that happens, you should try to extract the common thing into a central thing and fix that central thing once.
It feels weird to work like that at first, but I've found that often it results in simpler code and pretty effective abstractions, because it reacts to the actual change a code base experiences.
The challenge is that if you're not careful, you can end up copy-pasting the same bit of code hundreds of time before realizing it has to be changed.
I once worked in a year-old startup of ~5 developers that found it had written the same line of code (not even copy-pasted, it was only one line of code so the devs had just written it out) 110 times. A bug was then discovered in that line of code, and it had to be fixed in 110 places, with no guarantee that we'd even found all of them. This was a very non-obvious instance of DRY, too, because it was only one line of code and the devs believed it was so simple that it couldn't possibly be wrong. But that's why you sometimes need to be aware of what you're writing even on the token level.
That's why we have principles like "3 strikes and then you refactor". 3 times fixing a bug isn't too onerous; even 4-6 is pretty manageable. Once you get to 20+, there starts to be a strong disincentive to fixing the bug, and even if you want to, you aren't sure you got every instance.
This really makes me think we should be focusing on cost/benefit, risk/reward, pros/cons at all times. If we have a bug in these 5 copies, will it be too hard to fix in all of them? No? What about these 10 copies? If that sounds like its starting to get difficult, maybe now is the time.
If we have a bug in these 5 copies, will it be too hard to fix in all of them?
Yes, it will be, because copy-pasted code is never the same verbatim. First and foremost, name changes make it almost impossible to identify different copies. Then, there are different tweaks for each copy to make it suitable for the context. I always DRY early, because it's always free to copy-paste later.
I could make the same argument for not using DRY. The DRY-ed code is hard to change, programmers feel honour-bound to keep using it and tweaking it by adding a variety of parameters to more and more cases, and at the end becomes impossible to understand or update, slowing down development.
Now, what probably should've been 3 abstractions is one incredibly convoluted "abstraction" that makes no sense, and its 3x harder than 3 individual abstractions to deduplicate and inline. It further pulls and invites complexity, as its current size is implicit invitation to include additional cases and places.
Furthermore, while without DRY fixing bugs may've been tedious, now with DRY it may be almost impossible due to high risk of breaking a lot of things that depend on that code. (You might be lucky enough to be able to and have written extensive tests with 100% edge case coverage for it - if that's the case then you've postponed the moment of pain somewhat)
Both can be true. It depends on the context whether benefits exceed costs. Decisions should be made based on a specific context and with thinking applied, not generic rules.
5 copies are already extremely difficult.
It's not like they just jump into your eye when you edit some code.
It depends on the context. In some context, they might actually jump out. In some context, even if they don't, it might be fine, because the larger modules containing the code already have excellent tests and are solid and stable
That means you have to think. Most people hate thinking. Seriously.
Metz says she adds TODOs and comments that it has been duped. It's one of those things that requires thought, and she even says it's an advanced technique. How many times is too many? I'm not sure, but I can safely say over 100 is WAY too many. Probably 10 is too many. Heck, if you find yourself updating the same code in four different places over and over and over, it's time to abstract. The idea is to let the code sit and let the abstraction reveal itself if there isn't already an OBVIOUS one. As mentioned by the parent poster, you're looking out for these copies to diverge. If four or five copied codepaths haven't diverged after some time, there's a good chance that just from working on it every day you will have realized the proper way to abstract it.
You absolutely do have to be careful. But even so, it's arguable that having to update something in 100 different places is better than updating in one place and having it affect 100 different paths where you only want 99 of them (this is some hyperbole, of course).
How do you monitor all code duplications in the code base? Including ones that have been modified slightly ( such as optimizations, name changes, additional statements in between, etc)
Tests. AFAIC, this isn't something that should be long living. If it is only duplicated in a couple of places and remains unchanged for years, that's probably fine too, because ya... no one is touching it. If one place does need to change and tests still pass that should mean that the other one didn't need to change and you've reaped the benefit from not prematurely abstracting. There are a lot of ways it could play out, though. Often the duplication is very local and obvious. I think a lot of people take "duplication is cheaper than the wrong abstraction" WAY more seriously than its intended. It's actionable way of saying "don't abstract early" as the counter that is usually: "But then I'll have duplication and DRY is the law." Like EVERY piece of programming advice, though, it's not universal.
This is why you shouldn't write one line of code, ever again. /s
We've all been there though, at some point in our careers. Possibly multiples of times (try changing thousands of "echo" statements to call a logger because it was initially meant to be a simple script that just kept growing).
It sucks but I've also been on the other side, where it was DRY but 20% of the calls to the function now needed different behavior. Finding all of those usages was just as hard.
We've all been there though, at some point in our careers. Possibly multiples of times (try changing thousands of "echo" statements to call a logger because it was initially meant to be a simple script that just kept growing).
Been there - now unless it is a very simple / throwaway code, I always start with logging setup from the start. It also helps with print based debugging because you can tune the output.
Oh yeah we've had those as well. I kinda feel two things about these at the same time.
At a practical level, these situations sucked. Someone had to search for the common expression, look at each instance, decide to change it to the central place or not. They spent 2-3 days on that. And then you realize that some people were smart and employed DRY - if they needed that one expression 2-3 times, they'd extracted one sub-expression into a variable and suddenly there was no pattern to find those anymore. Those were 2-4 fun weeks for the whole team.
But at the same time, I think people learned an important concept there: To see if you are writing the same code, or if you're referring to the same concept and need the same source of truth, like the GP comment says. I'm pretty happy with that development. Which is also why my described way is just one tool in the toolbox.
Like, one of our code bases is an orchestration system and it defines the name of oidc-clients used in the infrastructure. These need to be the same across the endpoints for the authentication provider, as well as the endpoints consumed by the clients of the oidc provider - the oauth flows won't work otherwise.
And suddenly it clicked for a bunch of the dudes on the team why we should put the pedestrian act of jamming some strings together to get that client-id into some function. That way, we can refer to the concept or naming pattern and ensure the client will be identical across all necessary endpoints, over hoping that a million different string joins all over the place result in the same string.
In such a case, early or eager DRY is the correct choice, because this needs to be defined once and exactly once.
This happens with SQL a lot where people copy and paste queries all over the place. Especially for reports, there's always the case where some quick and dirty report was thrown together in 20 mins ends up as something managers can't live without.
Making changes quickly get onerous when the query (or slight variation on it) is pasted into multiple places. Nowadays my org has started to use Power BI so there is also multiple dashboards that all need to be updated.
You're thinking of Sandi Metz: https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction
Conversely, trying too hard to DRY when requirements at call sites start to diverge can lead to an unnecessary complex single implementation of something where there could be two very similar but still straightforward pieces of code.
I feel like I hear de-duplication of information / knowledge often referred to as "Single Source of Truth"
The Two Generals Problem is mentioned a lot in databases and networking, and you can take some liberties to extend it to human orgs.
The Two Generals Problem is mentioned a lot in databases and networking, and you can take some liberties to extend it to human orgs.
In fact, the name probably already gives some hint as to what sorts of human orgs this sort of principle would be applicable in.
And, well, the vast majority of orgs require some sort of coordination.
I think the best way to understand DRY is by thinking about the practical problem it solves: you don't want footguns in the codebase where you could change something in one place, but forget to change the same thing in other places (or forget to change the complementary logic/data in other components.)
The goal of DRY as a refactoring, is first-and-foremost to obviate such developer errors.
And therefore — if you want to be conservative about applying this "best practice" — then you could do that by just never thinking "DRY" until a developer does in fact trip over some particular duplication in your codebase and causes a problem.
you don't want footguns in the codebase where you could change something in one place, but forget to change the same thing in other places
This. Ironically the example on TFA is vulnerable to this issue. Each of the deadline setting methods has a copy of the validation ensuring that the date is in the future. If it's discovered that we need to ensure deadlines are set no later than the project deadline (since that wouldn't generally make sense), it's awfully easy to only update one and miss the others, especially after code has been added and these implementations are no longer visually near each other. I'm not saying that this means the code must be DRY'ed, but it is a risk from the beginning of the project, so one that needs to be weighed during initial implementation.
How is applying DRY entering premature optimization territory (maybe relative to LOC?)? I argue it is instead: premature abstraction.
Optimization is specialization (which is the opposite of DRY): to enable DRY you likely need to generalize the problem (i.e. abstract) such that you remove duplication.
I've always seen "Premature Optimization" as an umbrella that covers a variety of cross-cutting concerns, ranging from:
- Performance - Code structure / abstraction - Data structure - Team organization / org structure
I'd argue that DRY (and a focus on abstractions more generally) are optimizations of the codebase. Not all optimizations are optimizing the same thing.
Yeah, it's like reminding people that code can change, so it's ok to have known flaws day 1. Something forgotten too often.
One thing that really goes against the usual programming grain is DBMSes. We're taught to always decouple/abstract things, but I'm convinced that it's impossible to abstract away your DBMS in most applications. It's just too big of an interface, and performance considerations leak right through it. It's always one of the selling points of an ORM, "you can switch databases later," and then nobody actually switches.
Indeed - the acronym comes from The Pragmatic Programmer, and the author defined it in this way. Every blog post I've read criticizing/cautioning against DRY were not doing DRY as originally defined.
DRY is almost always a good thing to do. Coupling superficially similar code is definitely not a good thing to do.
Yeah, here's the quote from the later editions addressing this:
Let’s get something out of the way up-front. In the first edition of this book we did a poor job of explaining just what we meant by Don’t Repeat Yourself. Many people took it to refer to code only: they thought that DRY means “don’t copy-and-paste lines of source.” That is part of DRY, but it’s a tiny and fairly trivial part.
DRY is about the duplication of knowledge, of intent. It’s about expressing the same thing in two different places, possibly in two totally different ways.
Here’s the acid test: when some single facet of the code has to change, do you find yourself making that change in multiple places, and in multiple different formats? Do you have to change code and documentation, or a database schema and a structure that holds it, or...? If so, your code isn’t DRY.
Coupling superficially similar code is definitely not a good thing to do.
I've taken to calling that activity (removing syntactic redundancy that is only coincidental) "Huffman coding".
misunderstanding of what it means
And in response, people will complain that they're being dismissed with "you're doing it wrong!" Because that happens with everything in programmer-land.
The easy response to someone feeling this way is to point them to the origin of DRY: The Pragmatic Programmer.
In the book, the authors explicitly call out that many people took the wrong idea from the original writing. They clarify that DRY is not about code, it's about what they call "knowledge", and that code is just one expression of it.
People can still disagree, but the original intent behind DRY is very well articulated.
Someone “yes and”-ed a comment of mine awhile ago to teach me DRY SPOT. Don’t repeat yourself - Single Point of Truth.
I.e. what you said. Couple logic that needs to be coupled. Decouple logic that shouldn’t be coupled.
This is a more insightful comment than the comment at the top, and also more useful than the blog post.
Yeah, applies to databases and documentation especially. Databases have the ol' 3NF, you also want to avoid copying data from one source of truth to another in a multi-service environment, and sometimes I intentionally avoid writing docs because I want the code or API spec (with its comments) to be the only documentation.
Couldn't agree more. There's a great decade-old blog post by Mathias Verraes which illustrates this well, I keep coming back to it: https://verraes.net/2014/08/dry-is-about-knowledge/
Not a conclusive example.
In the industry code that isn't DRY is a much bigger problem than code that is too DRY.
I am the industry for over 10 years now. Whenever I have to work with a project where someone used DRY consciously, I know I am in for a world of pain. Consolidating code is easy, pulling it apart is a lot harder.
How do you consolidate code?
Good way to go at it is to isolate the functionality that is used many times and to pull it aside in its own function (or similar). That's just good code practice and also makes it easy to refactor and modify as needed.
It’s not about being used many times, but about the necessity to evolve in the same direction. When that happens, it usually manifests as toil for the team. Consolidating code means to change the structure of the code so that only one piece needs to be modified in the future. That can take many forms, but it usually involves creating a new shareable component.
Shareable components are more effort to maintain, so just creating them because they consolidate code is not always a good idea. You really want to have positive ROI here and you only get that if you actually reduce maintenance burden. For raw code duplication that doesn’t have a maintenance issue on it‘s own, the bar is a lot higher than most people think.
Raw code duplication is always a maintenance issue when centralising it when you notice the duplication (instead of keeping copy-pasting it) costs nothing.
Well, this morning I just fixed a case where somebody had used btoa to base64 encode something in Javascript and used methods from Buffer somewhere else because they'd been intimidated away from using btoa. (Ok, it is dirty to use UTF-8 codepoints if it is byte values, you can write btoa("Á") but btoa("中") is a crash.)
It would have been OK if they'd used the right methods on Buffer but they didn't.
These encoding/decoding methods are a very good example of code that should be centralized, not least so you can write tests for them. (It is a favorable case for testing because the inputs and outputs are well defined and there are no questions of whether execution is done like you might encounter testing a React component) It is so easy to screw this kind of thing up in a gross way or an a subtle way (I'm pretty sure btoa's weirdness doesn't affect my application because codepoints > 255 never show up... I think)
There's the meme that you should wait until something used 3 times before you copy it but here is a case where two repetitions were too many and it had a clear impact on customers.
I basically agree, but doesn't this just mean, if I'm consolidating non-DRY code, that I'm now the one using DRY consciously, and the next dev will be cursed with all of my newly introduced DRY abstractions?
If you don’t have another reason for consolidation than consolidation then yes :)
"Consolidating code is easy, pulling it apart is a lot harder."
My experience is the opposite. The less code, the better. I just spent a week on refactoring UI automation test code where they had copied the same 30 lines of code into almost 100 places. Every time with an ID changed and some slightly different formatting. It took me a few days to figure out that these sections do the same thing so I decided to introduce a function with ID as parameter. It was a lot of work to identify all sections and then to make sure they are really equivalent.
Saved us 3000 lines of code and now we can be sure that timeouts and other stuff is handled correctly everywhere. An we can respond to changes quickly.
that's DRY to me. Don't copy/paste code. Introduce functions. Ideally in the simplest way. When you have functions, you declare the same behavior everywhere.
Consolidating code is easy, pulling it apart is a lot harder.
I absolutely agree with this, and the only thing I would add is that is difference is even more pronounced in codebases using a dynamic language.
Sure it's not easy to navigate a bowl of duplicated spaghetti, but navigating opaque DRY service classes without explicit types is a nightmare.
Luckily as an industry we've realized the benefits of static typing, but your point still holds true there.
Can concur. Mostly it was I causing the pain, earlier.
Whenever I have to work with a project where someone used DRY []consciously[], I know I am in for a world of pain.
Huh. When you put it that way, that's actually a good point. In my experience, competent programming will try to consolidate repeated code, and then cite "because DRY" if asked why, but I can't think of any case where I or anyone else competent started with "needs more DRY" as the original motivation (as opposed to "this is a incomprehensibly verbose mess" or the like).
Conversely, starting with "don't repeat yourself [and don't let anything else repeat itself]" as a design goal does seem to correlate well with cases where someone temporarily (newbie) or permanently (moron/ideologue) incompentent followed that design principle off a cliff.
I have 20 years in the industry and one of the rules I learned is: Articles justifying laziness are ALWAYS warmly welcomed and praised.
To get internet points easily, write something of that:
“Clean code is overrated”
“SOLID is holding you back”
“Tests are less important than profits”
“KISS is the only important principle”
“Declarative programming is only suitable for pet projects”
“Borrow checker is the plague of Rust”
and so on.
Having specialized in project rescue, touring all over "the industry", you can't possibly make that generalization.
For every purported best practice, there are teams/orgs that painted themselves into a corner by getting carried away and others that really would have benefited from applying it more than they did.
In the case of DRY, it's an especially accessible best practice for inexperienced developers and the project leads many of them become. Many many teams do get carried away, mistaking "these two blocks of code have the same characters in the same sequence" with "these two delicate blocks of code are doing the same thing and will likely continue to do so"
Having advice articles floating around on both sides of practices like this helps developers and teams find the guidance that will get them from where they are to where they need to be.
Context, nuance, etc, etc
If that's what they wanted to prove they should have shown a better example.
That's fair. I think the insight/concept behind the essay is sound, but I agree that the example (and writing) could be a lot better.
In Zion National Park, there's a hike called Angel's Landing. For part of the hike, you go along this ridge, where on one side you have a cliff of 500 feet straight down, and on the other side, you have a cliff of 1000 feet straight down. And in places, the ridge is only a couple of feet wide.
Best practices can be like that. "Here's something to avoid!" "OK, I'll back far away from that." Yeah, but there's another cliff behind you, of the opposite error that is also waiting to wreck your code base.
Listen to best practices. Don't apply them dogmatically, or without good judgment.
In the industry code that isn't DRY is a much bigger problem than code that is too DRY.
As with anything dogmatic, it truly depends. There are times when the abstraction cost isn't worth it for a few semi-duplicate implementations you want to combine into a single every-edge-case function/method.
There's a certain psychological attraction to messy and confused situations which people are just too comfortable with but it explains why things like GraphQL (didn't have a definition for how it worked for years because "Facebook is going to return whatever it wants to return") inevitably win out over SPARQL (which has a well-defined algebra).
One of my biggest gripes (related to the post) is the data structure
create table student (
...
applied_date datetime,
transcript_received datetime,
recommendation_letter1_received datetime,
recommendation_letter2_received datetime,
rejected_date datetime,
accepted_date datetime,
started_classes_date datetime,
suspended_date datetime,
leave_of_absence_start_date datetime,
leave_of_absence_end_date datetime,
...
graduated_date datetime,
...
gave_money_date datetime,
died_date datetime
)
which is of course an academic example but that I've seen in many kind of e-business application. Nobody ever seems to think of it until later but two obvious requirements are: (1) query to see what state a user was in at a given time, (2) show the history of a given user. The code to do that in the above is highly complex and will change every time a new state gets added. The customer also has experiences like "we had a student who took two leaves of absence" or "some students apply, get rejected, apply again, then get accepted" When you find data designs like this you also tend to find some of the records are corrupted and when you are recovering the history of users there will be some you'll never get right.If you think before you code you might settle on this design
create table history (
student_id integer primary key,
status integer not null,
begin_date datetime not null,
end_date datetime
)
which solves the above problems and many others in most situations. (For one thing the obvious queries are trivial and event complex queries about times and events can be written with the better schema.) I can't decide if the thing I hate the most about being a programmer is having to clean up messes like the above or having to argue with other developers about why the first example is wrong.If "No code" is to really be revolutionary it's going to have to have built-in ontologies so that programmers get correct data structures for situations like the above that show up everyday in everyday bizaps where there is a clear right answer but it is usually ignored.
Two points here just for fine grain discussion:
1. The first table structure is a flat non-normalized table structure that trades normalization for easy to query and select computed properties
2. Second structure is a normalized table structure that trades the normalization for joins.
Either one is normalized so far as I know.
It is easy to write a query for the first that gets a list of students names and the dates they applied. That query is harder for the second one. On the other hand figuring out what state a user was in at time t could be a very hard problem with the first table.
My experience with the first is that you find corrupted data records, one cause of that will be that people will cut and paste the SQL queries so maybe 10% of the time they wind up updating the wrong date. Systems like that also seem to have problems with data entry mistakes.
The biggest advantage of #2 is ontological and not operational, which is that in a business process an item is usually in exactly one state out of a certain set of possible states. Turns out that this invariant influences the set of reasonable requirements that people could write, the subconscious expectations of what users expect, needs to be implicitly followed by an application, etc.
Granted some of the dates I listed up there don't quite correspond to a state change, for instance the system needs to keep track of when a student started an application and when the last document (transcripts, letters, etc.) has been received. With 5 documents you would have 32 possible states of received or not and that's unreasonable, particularly considering that a student with just one letter and a very strong application in every other way might get accepted despite that. It's fair to say the student can have an "open application" and a "complete application". Similarly you could say the construction of an airplane or a nuclear power plant can be defined by several major phases but that these systems have many parts installed so if the left engine is installed but the right engine is not installed these are properties of the left and right engine as opposed to the plane.
"In the industry code that isn't DRY is a much bigger problem than code that is too DRY."
which industry is that?
in general programming, absolute nope
not-DRY code can be weaseled out with a good ide
badly abstracted code, not so much
in fact in a way, DRY is the responsibility of the IDE not the programmer - an advanced IDE would be able to sync all the disparate code segments, and even DRY them if necessary
but when I read DRYed code, the abstraction better be a complete and meaningful summary, like 'make a sandwich', and without many parameters (and no special cases), or else I'd rather read the actual code
i understand the impulse to try to factorize everything but it just doesn't work beyond a certain point in the real world; it's too difficult to read, and there's always an 'oh, can you just' requirement that upends the entire abstract tower.
You didn’t provide any evidence for this, you just stated your coding preference. Which is usually the case in these discussions. Some anecdotes, and then people making grand claims based on personal preference. Obviously, some programmers have thought the opposite and have their own anecdotes.
the comment I replied to was merely a strong opinion
same same
i don't believe there is much evidence, certainly nothing conclusive, in this debate
but factorizing code concentrates the logic
that can be an advantage, to a certain degree, but it also reduces resilience, by specializing the code, and can reduce readability by forcing lookups of nested abstractions
In your part of the industry, perhaps. My experience has been the opposite.
Same. From what I've seen, most code is written with abstractions and DRY as a high priority rather than writing code that is performant and doesn't take jumping between 5 different files to make sense of it.
I started writing Go around 2012 or so because of the file jumping thing. Drove me nuts. I'm sure there were many folks doing the same thing.
Not from my experience. Unnecessarily duplicated code, even when there are small differences which are likely accidental, is usually much easier to fix than too DRY code. Pulling apart false sharing can be really hard.
example? duplicating (literally copy paste) is easier than even finding duplicated code with small differences.
Abstraction too early is usually a mistake, no one is smart enough to predict all the possible edge cases. Repeated code allows someone to go in there and add an edge case easily. Its a more fool proof way of programming
The number of person-hours wasted on over-engineered products that never even made it to release could have: solved the halting problem, delivered AGI v2.0, made C memory-safe without compromising backward-compatibility, or made it easy to adjust mouse pointer speed on Linux.
Generality can really hurt performance. Duplicating specialized code to handle different cases can really help optimize specific code hot spots for certain data patterns or use cases.
So DRY isn’t an obvious default for me.
I'd love examples where DRY can really hurt performance. Typically what matters most in terms of performance is the algorithm used, and that won't change.
More importantly, cleverer people than me said "premature optimization is the root of all evil"
Premature optimization is about not making a micro-implementation change (e.g. `++i` vs `i++`) for the sake of percieved performance. You should always measure to identify slow points in expected workloads, profile to identify the actual slow areas, make high-level changes (data structure, algorithm) first, then make more targetted optimizations if needed.
In some cases it makes sense, like writing SIMD/etc. specific assembly for compression/decompression or video/audio codecs, but more often than not the readable version is just as good -- especially when compilers can do the optimizations for you.
A lot of times I've found performance increases have come from not duplicating work -- e.g. not fetching the same data each time within a loop if it is fixed.
Not really. Knuth was talking about putting effort to make a non-critical portion of the software more optimized. He's saying put effort into the smaller parts where performance is critical and don't worry about the rest. It's not about `++i` vs. `i++` (which is semantically different but otherwise in modern compilers not an optimization anyways but I digress).
That was my point, though. Don't worry about minor possible changes to the code where the performance doesn't matter. For example, if the ++i/i++ is only ever executed at most 10 times in a loop, is on an integer (where the compiler can elide the semantic difference) and the body of the loop is 100x slower than that.
If you measure the code's performance and see the ++i/i++ is consuming a lot of the CPU time then by all means change it, but 99% of the time don't worry about it. Even better, create a benchmark to test the code performance and choose the best variant.
That's not my interpretation. If you're profiling and benchmarking you're already engaging in (premature) optimization. This process you're describing of finding out whether `i++` is taking a lot of CPU time and then changing it is exactly what Knuth is saying not to worry about for 97% of your code. Knuth is saying it doesn't matter if `i++` is slow if it's in a non-performance critical part of your code. Any large piece of software has many parts where it doesn't matter for any practical purpose how fast they run and certainly one loop in that piece of software doesn't matter. For example, the software I'm working on these days has some fast C code and then a pile of slow Python code. In your analogy all the Python code is known to be much slower than the C code, we don't need a profiler or benchmarks to tell that, but it also doesn't matter because the core performant functionality is in that C code.
Knuth says forget about small efficiencies in 97% of your code. Indeed, the `i++` optimization isn't apt to make more than a small difference, even with the most naive compiler, but other decisions could lead to larger chasms. It seems he is still in favour of optimizing for the big wins across the entire codebase, even if it doesn't really matter in practice.
But it's your life to live. Who cares what someone else thinks?
The optimizations he was talking about were things like writing in assembly or hand-unrolling loops. It was assumed that you’ve already picked an performant algorithm / architecture and are writing in a performant low level language like C.
Also, your digression about modern compilers is irrelevant to the context of the quote, since Knuth talked about premature optimization at a time when compilers were much simpler than today.
This quote is often taken out of context, here's the full quote: "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
If you want a specific example look at something that needs to be performant, i.e. in those 3%, let's say OpenSSL's AES implementation for x86, or some optimized LLM code, you'll see the critical performance sections include things that could be reused, but they're not.
Also the point Knuth is making is don't waste time on things that don't matter. Overuse of DRY falls squarely into that camp as well. It takes more work and doesn't really help. I like Go's proverb there of "A little copying is better than a little dependency."
Knuth was talking about a very specific thing, and the generalization of that quote is a misunderstanding of his point.
Source: Donald Knuth on the Lex Fridman podcast, when Lex asks him about that phrase
I wasn't aware this was discussed, thanks for the pointer! I'm curious now what he says he was talking about ;)
Here's that segment: https://www.youtube.com/watch?v=74RdET79q40
I'd love examples where DRY can really hurt performance.
A really common example is overhead of polymorphism, although that overhead can vary a lot between stacks. Another is just the effect caused by the common complaint about premature abstraction: proliferation of options/special cases, which add overhead to every case even when they don’t apply.
use compile time polymorphism
premature abstraction -> not understood dry (https://news.ycombinator.com/item?id=40525064#40525690)
This might not be a perfect example, but there's a paper by Michael Stonebraker "One size fits all": an idea whose time has come and gone
It might not specifically be DRY, but still related generic vs specialized code/systems.
In the general case, it usually depends on the latency of what you'd DRY your code to vs the latency of keeping the implementation local and specialized.
If you're talking about consolidating some code from one in-process place to another in the same language, you're mostly right: there's only going to be an optimization/performance concern when you have a very specific hotspot -- at which point you can selectively break the rule, following the guidance you quoted. This need for rule-breaking can turn out to be common in high-performance projects like audio, graphics, etc but is probably not what the GP had in mind.
In many environments, though, DRY'ing can mean moving some implementation to some out-of-language/runtime, out-of-process. or even out-of-instance service.
For many workloads, the overhead of making a bridged, IPC, or network call swamps your algorithm choice and this is often apparent immediately during design/development time. It's not premature optimization to say "we'll do a lot better to process these records locally using this contextually tuned approach than we will calling that service way out over there, even if the service can handle large/different loads more efficiently". It's just common sense. This happens a lot in some teams/organizations/projects.
IMO it hurts developer productivity more than performance, because it introduces indirection and potentially unhelpful abstractions that can obscure what is actually going on and make it harder to understand the code.
In raw performance this could manifest as issues with data duplication bloating structures and resulting in cache misses, generic structures expressed in JSON being slower then a purpose-built struct, chasing pointers because of functions buried in polymorphic hierarchies. But I doubt that any of this would really matter in 99% of applications.
Langchain. Helps on the initial productivity, is a nightmare on the debugging and performance improvement end.
In an effort to DRY, you add a bunch of if statements to handle every use case.
I think it really depends and it's a case where a lot of engineering judgment and taste comes to bear. For example right now I'm maintaining a Jenkins system that has two large and complicated pipelines that are about 90% overlapping but for wretched historical reasons were implemented separately and the implementations have diverged over the years in subtle ways that now make it challenging to re-unify them.
There is no question in my mind that this should always have been built as either a single pipeline with some parameters to cover the two use-cases, or perhaps as a toolbox of reusable components that are then used for the overlapping area. But I expect the mentality at the time the second one was being stood up was that it would be less disruptive to just build the new stuff as a parallel implementation and figure out later how to avoid the duplication.
You are describing technical debt, not conscious decisions to be DRY or not DRY.
Hmm. Certainly there's no doubt that there's technical debt ("do it this way for now, we'll clean it up later") here too, but I think there was also a conscious decision to build something parallel rather than generalizing the thing that already existed to accommodate expanding requirements.
not drying is technical debt
Generality can really hurt performance.
Only in critical regions of code though.
I agree for very specific situations, but compilers tend to get better at optimization over time, and it can be better to express plain intent in the code and leave low-level optimization to the compiler, rather than optimizing in code and leaving future hardware/compiler improvements on the table.
The example is hilariously terrible. Firstly, this is the currently required code:
def set_deadline(deadline):
if deadline <= datetime.now():
raise ValueError("Date must be in the future")
set_deadline(datetime(2024, 3, 12))
set_deadline(datetime(2024, 3, 18))
There simply is no trade-off to be made at this point. Perhaps there will be eventually, but right now, there is one function needed in two places. Turning two functions that already could be one into a class is absurd.Now, as far as teaching best practices goes, I also dislike this post because it doesn't explicitly explain the pros and cons of refactoring vs not refactoring in any detail. There is no guidance whatsoever (ie: Martin Fowler's Rule of Three). This is Google we're talking about, and newer developers could easily be led astray by nonsense like this. Addressing the two extremes, and getting into how solving this problem requires some nuance and practical experience is much more productive.
Almost all programming tutorials and even books to a certain extent suffer with the problem of terrible examples. Properly motivating most design patterns requires context of a sufficiently complex codebase that tutorials and books simply do not have the space of getting into. This particular case is especially bad, probably because they had the goal of having the whole article fit in one page. ("You can download a printer-friendly version to display in your office.")
There is no guidance whatsoever (ie: Martin Fowler's Rule of Three).
That is completely unfair imo. Although not properly motivated, the advice is all there. "When designing abstractions, do not prematurely couple behaviors that may evolve separately in the longer term." "When in doubt, keep behaviors separate until enough common patterns emerge over time that justify the coupling."
Simplified maxims like "Rule of Three" do more harm than good. Don't couple unrelated concerns is a much higher programming virtue than DRY.
Properly motivating most design patterns requires context of a sufficiently complex codebase
As someone that's made a best selling technical course, I strongly disagree.
It's 100% laziness and/or disregard for the reader.
The reason examples are as bad as they are is that people rush to get something published rather than put themselves in the audience's position and make sure it's concise and makes sense.
It's not like webpage space is expensive. There's plenty of room to walk through a good example, it just requires a little effort.
What does sales have to do with what you're claiming? Please share the course and or examples of it being done well without requiring that excessive context, so that there's something to support your claim.
Well if my course and teaching was crap I wouldn't get good reviews and therefore many sales. I've spent $0 on marketing.
https://www.udemy.com/neo4j-foundations/
There are many people who do teach and explain topics well. Richard Feynman comes to mind.
I've found Abdul Bari on YouTube to also be an excellent teacher around technical topics.
Not related to the topic at hand, but who buys these courses? Going off the chapter titles it looks like it’s all basic ‘read the documentation’ kind of stuff (to me). I could imagine it being useful to beginners, but not anyone with a moderate amount of experience (they’d just go to the Neo4j documentation).
On the other hand, what beginner starts with Neo4j and Cypher? Is there really enough of them to justify a whole course? Apparently there are, it just feels weird to me.
You're right in that if you go through the docs you can find all the info you might need.
It's really catered for beginners, people that have next to no knowledge of graph databases or Neo4j and want to get up to speed in just a few hours.
I imagine some people may not even be super technical, but may want to learn just the basics of querying a DB at work to get some basic info out of it.
Apart from lessons there are also exercises for people to practice what they just learnt, and I do my best to point out gotchas and keep it mildly entertaining with a gentle progression in difficulty.
It's not like webpage space is expensive. There's plenty of room to walk through a good example, it just requires a little effort.
Right at the top of the page:
A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.
So no, there isn't room for a longer example.
It's not like webpage space is expensive.
It is not the webpage space. It is people's limited attention spans and ability to focus. A complex example is needed to properly motivate certain concepts, but too complex an example also contains too many other details that the reader gets bogged down/distracted from the main concept being discussed.
At least that is my hypothesis for why almost all programming books and tutorials have terrible examples. I am happy to be proven wrong.
Coming back to the article, I looked at some of the previous articles from the same series, and to me it feels like a very conscious decision to only include 3-4 line code examples.
Your example, deduplicating the two functions into one, illustrates an interesting point, although I'd prefer still having the two specialized functions there:
def set_deadline(deadline):
if deadline <= datetime.now():
raise ValueError("Date must be in the future")
def set_task_deadline(task_deadline):
set_deadline(task_deadline)
def set_payment_deadline(payment_deadline):
set_deadline(payment_deadline)
set_task_deadline(datetime(2024, 3, 12))
set_payment_deadline(datetime(2024, 3, 18))
You lose absolutely nothing. If you later want to handle the two cases differently, most IDEs allow you to inline the set_deadline method in a single key stroke.So the argument from the article...
Applying DRY principles too rigidly leads to premature abstractions that make future changes more complex than necessary.
...does not apply to this example.
There clearly are kinds of DRY code that are less easy to reverse. Maybe we should strive for DRY code that can be easily transformed into WET (Write Everything Twice) code.
(Although I haven't worked with LISPs, macros seem to provide a means of abstraction that can be easily undone without risk: just macro-expand them)
In my experience, it can be much harder to transform WET code into DRY code because you need to resolve all those little inconsistencies between once-perfect copies.
You lose absolutely nothing. If you later want to handle the two cases differently, most IDEs allow you to inline the set_deadline method in a single key stroke.
Problem with unintentional coupling isn't that you can't undo it. It is that someday someone from some other team is going to change the method to add behaviour they need for their own use case that is different from your own and you won't even notice until there is a regression.
in this case (which shouldn't happen because it requires that you merged things that don't belong together - see accidental duplication), at least the one changing the method has all information on his hands and doesn't have to keep a potentially complex graph of copied code in his mind.
I can only assume the Google example would be part of a script/cli program that is meant to crash with an error on a bad parameter or similar. Perhaps the point is to catch the exception for control flow?
My personal goal is to get things done in as few lines of code as possible, without cramming a bunch on one line. Instead of coming up with fancy names for things, I try to call it by the simplest name to describe what it's currently doing, which can be difficult and is subjective.
If we wanted to define a function which crashes like the example, I would probably write this:
def throw_past_datetime(dt):
if dt <= datetime.now():
raise ValueError("Date must be in the future")
If the point is not to crash/throw for control flow reasons, I'd write this in non-cli/script code instead of defining a function: dt = datetime(2024, 5, 29)
if dt < datetine.now():
# Handle past date gracefully?
If it needs to do more in the future, I can change it then.I was going to say you were talking nonsense, but then realized I’d replaced the original post in my mind, by this much nicer post that someone else linked in this thread:
https://verraes.net/2014/08/dry-is-about-knowledge/
They essentially say the same thing, but one is better than the other.
DRY code (usually with lot of IF blocks to handle special cases, or various oop lasagna) eventually turns into unmaintainable nightmare where every trivial new feature can take hours to implement and is very difficult, full of cussing, hair-pulling kind of programming where every 5 minutes you think "we need to rewrite everything from scratch, the system wasn't designed for this". Every change breaks million different unrelated things because of the complexity of extremely dry functions.
In WET code (write everything twice) everything looks primitive, as if it was written by complete newbie, and every change needs to be added at multiple places, but each change is trivial and time to finish is predictable. I would go as far as calling the code boring. The most difficult thing is to resist the temptation to remove the duplicity.
In WET code (write everything twice) everything looks primitive, as if it was written by complete newbie, and every change needs to be added at multiple places, but each change is trivial and time to finish is predictable. I would go as far as calling the code boring. The most difficult thing is to resist the temptation to remove the duplicity.
This only scales so far. After some point, it's very easy to run into cases where you meant to change something everywhere but forgot/didn't know about others. Not to say everything should be so compartmentalized as to restrict change, but there is a balance to be had.
Yes, what actually happens is that many code changes are released half-baked because logic only got updated in 1 (or 13) of the 14 places that needed to be updated, and the cussing and hair pulling just starts later.
Tests, baby
Tests don't really help you when a newly discovered bug affects logic copied in ten places and you're only aware of two of them. You can add a regression test to the places that you update, but not the others. And then if there's another bug discovered in the duplicated code, a different subset of the copies might get changed and have tests added. Suddenly it looks like these different versions of the repeated logic are intended to be behaving differently for some unknown reason even though the divergence is purely accidental.
Which is why you need a balance between WET and DRY. DAMP = Don't Alter in Many Places.
I've never heard this one before, but I love it. Unfortunately we've also got "Don't Abstract Methods Prematurely" and "Descriptive And Meaningful Phrases".
Or, use a sufficiently well designed type-checking compiler, like GHC.
I've seen code like this, what eventually happens is that all your 'copies' drift to be slightly different. Fixes get applied to some but not all of them, people copy from old code vs. new code, etc. And whenever you need to apply a fix you spend hours trying to figure out where each copy is, what it is supposed to be doing (since they're all different), and how the fix can be applied to it. You inevitably don't find them all and repeat the cycle.
This is also true. Some versions can be buggy and then a fix might not get everywhere. My favorite example is C code bases with a multitude of linked list implementations.
Like many things in software knowing when to do something and when not to do something can be hard. Premature DRY as the article mentions can lead to difficulty when eventually the use case diverges. Re-implementing everything every time everywhere is also silly. As mentioned in other places, I like the rule of 3, if you have 3 examples the point strongly a certain direction that's probably a good one to follow.
Which tool do you use to manage the copies of code segments, including possible modifications?
I imagine it's difficult to keep all of them in your head
I think the key is more to ask yourself what it is you're abstracting away and whether the two things are actually doing the same thing. Just because the code is "shaped" in the same way (as in the article) doesn't mean it's actually conveying the same idea. If they're not really the same then the abstraction won't make sense and will just make things messier down the road.
It's the same thing as naming constants, but with code. If I have 3 `10` magic numbers in my code, I don't just immediately abstract them into a `const ten = 10` because they look the same, I abstract them into constants based on their actual purpose.
Why does everything need to be either DRY or WET? Like most things in life, there are no absolutes in programming, only tradeoffs.
Those may be good principles to think about when you are starting out with programming, but the key is to gain experience trying to solve problems in different ways.
Over time, you get better at making the right decision about whether duplicative code should be refactored or left alone.
To be fair, we can’t always make the “right” decision, but we can at least try to make the best decision we can based on the knowledge we have.
but each change is trivial and time to finish is predictable
Until your ‘if’ soup reaches all the locations in your codebase, and now you have 10 different places with too many if statements instead of one. Likely all touching slightly different things, so you can’t simply copy from one file to the other any more.
DRY is _not a best practice_. Repetition is a "code smell" - it often suggests a missing abstraction that would allow for code reuse (what sort of abstraction depends on the language and context), but "blindly-drying" is in my experience the _single most frequent mistake_ made my mid-to-senior engineers.
My experience is mostly in Ruby though, so I'm not sure how well it generalizes here :-)
Premature DIY can lead to the wrong abstractions. Sometimes code looks similar but actually isn't.
At my first big corporate jobs, I got to work on a codebase that was nothing but premature DRY’d code, but I didn’t know it at the time. As someone who was self taught, and suffered from imposter syndrome as many of us do/did in that situation, I thought I was missing something huge until I was talking to a senior developer and these strange design decisions came up, to which he said something like
Yeah, that was written by <ex-engineer> and he couldn't abstract his way out of a paper bag
I guess the real lessons were the crappy decisions that someone else made along the way.
It would be better to make a class for languages where DRY is not a best practice, then create classes of languages where it is a best practice or may be a best practice through multiple inheritance. To keep things simple.
:)
FWIW I completely agree in python, Java, typescript, and golang. I've seen people just parrot dogma about DRY and SOLID principals where their DRY'd code is completely not open to extension etc
Premature dry'ing is the same as premature engineering. And lest someone go 'oh so YAGNI is all you need'... no, sometimes you are going to need it and it's better to at least make your code easily moldable to 'it' now instead of later. Future potential needs can absolutely drive design decisions
My whole point is that dogma is dumb. If we had steadfast easy rules that applied in literally every situation, we could just hand off our work to some mechanical turks and the role of software engineer would be redundant. Today, that's not the case, and it's literally our job to balance our wisdom and experience against the current situation. And yes, we will absolutely get it wrong from time to time, just hopefully a lower percentage of occasions as we gain experience.
The only dogma I live by for code is 'boring is usually better', and the only reason I stick by that is because it implicitly calls out that it's not a real dogma in that it doesn't apply in all cases.
(Okay, I definitely follow more principals than that, but don't want to distract from the topic athand)
My experiences are the same in C++ and Python. C++ in particular can get way out of hand in service of DRY.
Yeah I've had so many problems with understanding and working with other people's code bases when the person was obsessed with DRY.
You wrote that code 4 years ago with tons of abstractions designed for some day someone not having to repeat themselves... but it's been years and they've never been useful. However I've had to dig through a dozen files to make the change I needed to make which by all rights should have been entirely contained in a few lines.
My most common reaction to a new codebase is "where the hell does anything actually get done" because of silly over-abstraction which aspires to, one day, save a developer five minutes or three lines of copied code.
"blindly-drying"
Right. It's not an optimization problem!
Remember in school when you learned to turn a truth table into a Karnaugh map and then use it to find the smallest equivalent logic expression? Well, your code is not a Karnaugh map, is it?
Your code should be WET before it’s DRY (Write Everything Twice).
The rule of three[1] also comes to mind and is a hard learned lesson.
My brain has a tendency to desire refactoring when I see two similar functions, I want to refactor--it's almost always a bad idea. More often than not, I later find out that the premature refactoring would've forced me to split the functions again.
1: https://en.m.wikipedia.org/wiki/Rule_of_three_(computer_prog...
Nice, I advocate for this but never new it was a more formal thing.
better understand DRY first, before you add lots of technical debt to your code: https://news.ycombinator.com/item?id=40525064#40525690
Yes, why?
Because you’re unlikely to write a good abstraction until you need it more than twice.
And if you only need the code twice, you very likely wasted time writing the abstraction because copying updates between the two locations is not hard.
This is a rule of thumb, I’m not trying to tell anyone how to do their job.
Or alternatively, Write Everything Today.
DRY when it's a wielded as a premature optimization (like all other premature optimization) prevents working code that is tailored to solving a problem from shipping quickly.
I don't know about anyone else, but I've been deeply unimpressed with the output of the google testing blog.
This example is not wrong, but it's not particularly insightful either. Sandi Metz said it better here, 8 years ago https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction
The testing pyramid nonsense is probably the worst one though. Instead of trying to find a sensible way to match the test type to the code, they pulled some "one size fits all" shit while advertising that they aren't that bothered about fixing their flaky tests.
I think you're holding some of these to too high of a bar. This is a one-page article intended to be posted in company bathrooms. Of course it's less comprehensive than a longer blog post.
It's not like the subtitle says "not to be taken seriously" and they are representing a brand that is supposed to stand for engineering excellence.
Google doesn't test! That is what production and SREs and users are for.
Most seasoned software engineers stopped following google in that respect a long time ago. They are not a tech shop any more; it's just an add business now with lots of SRE work.
The article builds a straw man though. The "bad example" is bad because it introduces OOP for no reason at all.
What's wrong with:
def set_deadline(deadline):
if deadline <= datetime.now():
raise ValueError("Date must be in the future")
set_task_deadline = set_deadline
set_payment_deadline = set_deadline
You don't need code duplication to avoid bad abstractions.Later down the line, if you want to have separate behaviour for task deadlines vs payment deadlines, you're going to have to go through your codebase and look at every call to set_deadline and figure out if it's being used to set a task deadline or payment deadline. If you have an inkling that the deadlines might need a different behaviour, the “good example” can save you an annoying refactor in the future.
Don’t make the symbol public, or call it _set_deadline, or whatever is the idiom in Python. The point of this example is ofc not having set_deadline be used, but the other symbols.
Again, you don’t need to duplicate a function body just to have semantic names.
Its not about OOP but the probability that those two functions will diverge. Linked elsewhere in the comments too, this article (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction) is probably better at articulating the point.
Exactly. I was happy to see a code example, but facepalmed when I actually read it.
I mean, sure. I'm generally more WET than most of my colleagues, but this...
Applying DRY principles too rigidly leads to premature abstractions that make future changes more complex than necessary.
... is just one of those things that sounds wise, but is just basically a tautology. Use the best tool for the job, etc. No kidding? Would never have thought of that on my own, thanks sensei.
Seriously, the issue with the quoted statement is not that it's new to anyone, it's that no one thinks they ARE applying DRY principles "too rigidly". This is just chin beard stroking advice for "everyone else".
Well, then, here's some advice:
Learn when to DRY, and when not.
No, you probably don't know as well as you think you do. No, you're not going to get there by grinding leetcode. No, you aren't going to get there quickly, or without a lot of interaction with more-experienced peers, or without being told that your judgment is bad a few times. (And if you don't listen - really listen - then you don't learn.)
Good judgment in these things takes time and experience. If you have a year of experience and think you know, you're probably wrong.
takes time and experience
Or maybe asking yourself what end goal you're trying to achieve by building an abstraction.
it's that no one thinks they ARE applying DRY principles "too rigidly"
But they should think twice if they are building abstractions only for the sake of DRY.
Reminds me of a conversation I had with a project manager. To match the example, I'll recast it in terms of deadlines.
Project manager: Sam is working on a deadline validator. You made a deadline validator last sprint right? Could Sam use yours?
Me: No, unfortunately not. My deadline validator enforces that deadlines are in the future and are aligned with midnight UTC, to ensure correct date calculations in the database. The deadline validator Sam is working on does not enforce those restrictions. Sam's deadline validator will be applied to user input for an entirely different field, where deadlines don't have to be at midnight and are just as often in the past as in the future. In fact, Sam's validator only checks that a deadline string has the expected format and is within twenty years of the present day. My validator operates on timestamps sent as integers from another service, not string values uploaded by users.
Project manager: So your deadline validator is not reusable at all? That's unfortunate. Is there something we could have done differently to avoid this redundant work?
This really hit close to home. Many years ago a project manager said the same thing to me. Unfortunately in that case, Sam's new deadline validator has much more simplified requirements than the developer's. The developer's deadline validator takes ten function arguments to customize its behavior in various ways. One of the arguments, a stringly typed argument allows the validator to match the fraction of the second against a string. Sam thought providing the empty string "" would ask the validator to accept all possible fractional values, but unfortunately it requires the magic string "*" to accept all possible fractional values and the empty string "" only allows zero fractional values, i.e. the deadline must be an integral number of seconds.
An outage happened because the system rejected every single deadline that's not an integral number of seconds.
Developer, next time: No, I made a real-time database constraint policy enforcement engine. Totally different thing.
In fact, Sam's validator only checks that a deadline string has the expected format and is within twenty years of the present day.
You two have been talking about this longer than it took Sam to implement by now ;)
People talk about DRY and then happily type pip/gem/npm into their terminal, and never look at 99 percent of what they just downloaded...
Did we all forget leftpad? https://qz.com/646467/how-one-programmer-broke-the-internet-...
Isn't leftpad the natural conclusion of DRY? Everything is a unique, small, contained and tested library that other code can depend on instead of reimplementing it? The ultimate one source of truth, where if it breaks half the internet breaks.
Isn't leftpad the natural conclusion of DRY? Everything is a unique, small, contained and tested library that other code can depend on instead of reimplementing it?
There is nothing in dry that says "util" or "frameworks" or "toolchains" are bad.
The ultimate one source of truth, where if it breaks half the internet breaks.
Dry says nothing about versioning, or vendoring or deleting your code from the internet...
The reality is that leftpad wasnt used by that many things. Its just that the things that did use it were all over the dependency graph...
My rule: repeat yourself 3 times. On the 4th, re-factor.
A good alternative to DRY is WET, or “Write Everything Twice.” Or, in your case, “Write Everything Thrice”. Both better alternatives than automatic, dogmatic DRY.
they aren't better and lead to maintenance hell
I think the only hard and fast rule is to DRY the code that will introduce a bug if you change it in one place and not the other. And if it will, at least do a fat comment in both places for posterity.
Whenever I have to have a "mental model" of the code, I know I screwed up.
+1. If I go with the comment option, then I'll sometimes write a comment like "If you change this here, then you must change it everywhere with this tag: UNIQUE-TAG".
This way, the reader can just do a global grep to find all the places to change, and you don't have to list them in each place and keep them in sync.
A comment is a nice addition, but the very least is to ensure that your test suite properly covers cases where changing one and not the other will introduce a problem. This not only ensures that both are changed, but that both are changed in the way they need to be. A comment alone may prompt you to change both (if you ever read it - I bet a lot of developers don't), but you may not notice when you fail to change them in the same way, which is no better than not changing one.
A practical rule for the presented problem is "wait until you have 3". The number being in reference to the amount of different cases which need to be handled. You're not likely to catch everything that will come up but you'll get enough to think of an extensible abstraction if you don't realize that you already have a workable one.
It really depends on what this code is doing. If it is dialog window rendering - yes, not so important. If it's complicated data validation - you better make it reusable and pure from the beginning.
I agree. The data validation example doesn't seem contrary to the advice; you would generally have at least three data types you need to handle in such a case.
The difference with the dialog window is that (presumably) you don't know the different flavors of window you'll need to render so adding an abstraction on top of the existing rendering abstraction fits squarely in "premature optimization".
The term "DRY" as commonly used conflates distinct situations and objectives that should be handled differently in most cases.
There is the "single source of truth" problem where you need to compute something exactly the same way at all points in the software that need to compute that thing. In these cases you really want a single library implementation so that the meaning of that computation does not accidentally diverge over time since there is a single implementation to maintain.
There is the "reuse behaviors in unrelated contexts" problem where you want to create implementations of common useful behaviors that can be abstracted over many use cases, often in the context of data structures and algorithms. In these cases you really want generics and metaprogramming to codegen a context-specific implementation rather than sharing a single implementation with a spaghetti mess of conditionals.
DRY works best when it fits neatly and exclusively into one of these two categories. Cases that fit in neither category, such as the practice of decomposing every non-trivial function into a bunch of micro-functions that call each other, are virtually always a maintenance nightmare for no obvious benefit. Cases that fit into both categories, such as expansive metaprogramming libraries, become difficult to maintain by virtue of the combinatorial explosion of possible implementations that might be generated across the allowable parameter space -- the cognitive overhead grows exponentially for what is often a linear increase in value.
Common Lisp works well for the second case. But it seems that some programmers are uncomfortable with the notion of code generating code. And as you said, it does require discipline as you need to focus the language's power. Other languages don't let you solve the boilerplate problem so readily. Instead you have a mess of utility functions or a huge class tree.
such as the practice of decomposing every non-trivial function into a bunch of micro-functions that call each other,
That has approximately nothing to do with DRY. At best it might technically be a violation of DRY if (and only if) some of those micro-functions are identical, but the correct way to fix that is to recompose them into the non-trivial (and non-repeated) functions they're more usefully expressed as. And more often it's just a totally independent refucktoring that makes the codbase worse entirely orthogonally to 'DRY-ness'.
https://grugbrain.dev/#grug-on-dry
grug begin feel repeat/copy paste code with small variation is better than many callback/closures passed arguments or elaborate object model: too hard complex for too little benefit at times
hard balance here, repeat code always still make grug stare and say "mmm" often, but experience show repeat code sometimes often better than complex DRY solution
something i have learned the hard way is that DRYing out too fast paints you into architectural corners you don't even know are there yet.
read the whole article and wow!
grug tell facts
is free country sort of
I think DRY should be more "Don't repeat assumptions".
Or rather, don't assume the same thing in two different places, especially not implicitly.
Avoiding code duplication mostly follows from that.
That's barely possible because code has to coherently work on a shared goal. But I agree in principle -- reduce duplication of assumptions as much as possible.
One way that helps with that is creating abstractions. I don't mean clever grandiose abstractions, those are extremely hard to get right. I mean precisely those abstractions that factor shared code.
But even before that, it's important to get the control flow right to minimize doing the same thing multiple times in the codebase. Because even when the implementation is in a central place, calling it multiple times from different locations is what you say, duplicating assumptions.
I think it's quite often possible! Most of code acts as implication. Given this piece of data (of this shape), I can compute that. Or given I'm a valid object (constructor completed successfully), I can do this and that.
However, very often, unstated assumptions sneak in.
For example, assuming to understand how to interpret a certain string, which usually causes all kind of escaping issues.
This DRY principle has ruined so many code bases merging so many facets into one giant monster of complexity that then later has to be specialised with flags and enums that I can't count how many times I have seen such clever PRs.
IMHO - some of the clean code books have ruined the industry as much as the virtues of microservice preachers have.
The second number goes to the Javascript tooling.
Can you honestly say with a straight face that those same code authors would be any better if they hadn't read those books? Sometimes the problem isn't the book but a lack of critical thinking. Using a tool or or method because a book said to use it and not because it was the time and place, is an obvious sign of that to me.
And here's the kicker: If you were aware the individual was taking inspiration from a book easily in such a way, what made it difficult for them to take inspiration from you through a dialogue? Hopefully your arguments were stronger than your HN comment.
This reads like a paraphrase of this widely circulated post from 8 years ago…
Thanks! I was looking for this blog post for a while now
Reminds me of this[1] great blogpost: "This abstraction adds overhead. “Abstracting” the common operation has made it more difficult to read, not less difficult to read. People for who consider meta-programming some sort of Black Magic often make this exact point: The mechanism for removing duplication adds complexity itself. One view is that the overall effect is only a win if the complexity added is small compared to the duplication removed."
[1]: http://weblog.raganwald.com/2007/12/golf-is-good-program-spo...
maintainability is more important than readability alone
To the author: please do not use non-ascii quotes (“”) in code.
Some blog engines try to be too fancy and do it automatically
The problem with all the "best practices" is that quite often they are sensible within some context but can be a detrimental tradeoff in a different one.
"it depends" is almost always the correct answer.
But what we see is young, inexperienced and zealous coders trying too hard and implemented the so called best practices before they understand them.
And I don't think there are too many shortcuts to replace experience.
My advice for beginners and intermediate is to first stick to the simplest solution that works, and don't be afraid to rewrite.
And I don't think there are too many shortcuts to replace experience.
But that's the point of these blogs: helping those without experience. Should we leave them to flounder on their own until they "figure it out" instead of trying to pass along wisdom?
There's evidence that the best approach is, yes, experience – but with Expert Feedback. In practice, this looks like pairing and informal apprenticeship with competent, seasoned engineers.
I can confirm from my own experience how much you can learn from working with engineers "further down the road".
In an unrelated note, this "Google Blog" thing appears to have [at least] three different domain names that redirect to the same url: https://blog.google.com, https://blog.google and https://googleblog.com, why is that?!
History? The Google TLD is relatively new and would have been created last
Fixing duplication is far easier than the wrong abstraction.
Yes. If you abstract without a specific need, you are likely to end up with an abstraction that either wastes time because it's never used or that you will later need to fight against because the changes you need to make don't mesh well with it. At that point you have to choose between a lengthy rework of the code or awful hacks to bypass the abstraction.
Don't use DRY but WET: Write Everything Twice. :o)
In practice: The second time you implement something, start out by copy-pasting the first implementation. Once you're done with the second version, figure out if and how to abstract the two implementations.
Basically never ever abstract something if you only have two copies. Copy/paste and move on and be productive. It's usually only if you have a third or fourth copy that you start to see if there is any inherent abstraction to be gleamed.
I think it depends on how you deduplicate your code. Creating a DeadlineSettter as illustrated is definitely too much, but creating a function:
def assert_datetime_in_future(datetime):
if datetime <= datetime.now():
raise ValueError(
“Date must be in the future”)
and then calling that from both places seems fairly reasonable to me.Right? Creating a noun instead of a verb is the real anti-pattern I see here. (Once you have a DeadlineSetter, it's a slippery slope down to ClassInstanceFactoryConfigProxyManager, etc.)
I find the term DRY to be pretty vague.
Let’s say you have a business rule that you can never have more than 5 widgets. You can make this assumption in multiple places, even with totally different code, and that’s damaging DRY when the rule changes to allowing 6. On the other hand, having a bit of duplicated HTML can help, as they may only be the same by accident.
In thise case, '5 widgets max' would be a parameter that should be defined as a global constant instead of having a hard-coded 5s all over the place or, worse, pieces of code copy-pasted 5 times... That's a standard good coding practice.
DRY is more about support and maintenance than anything else.
I see a lot of attacks on DRY these days, and it boggles my mind. Maybe it is being conflated with over-engineering/paramterization/architecting. I don't know.
But I do know that having to fix the same bug twice in the same code base is not a good look.
It’s not that. It’s when you need to change how the function behaves but for only one of the callers.
Boilerplate that you can't get wrong is better than DRY in most cases
by "get wrong" I mean through static analysis (linters or type checkers) or if it is plainly obvious by running it.
isn't encoding a requirement in the type applying DRY as well then?
There is an unpraised advantage of keeping code, uh wet?, for as long as possible. When you do decide that a refactor is required, you have real use cases to test your abstraction against. While I am broadly in agreement with the article because of that, people designing public binary APIs don't have the luxury of delaying these choices.
Seems like a strawman. The thing being repeated here is something which raises if the datetime isn't in the future. So abstract that out and you then get both methods calling raiseIfDateTimeNotInFuture() which then also serves as documentation.
(But yes, if the actual code is as simple as this example, you may as well just repeat it.)
Especially with AI. It’s better to teach the AI many examples and let it understand the implicit abstractions. Who needs to worry about reusable higher order abstractions when the machine just busts out the code you need in a single file?
Related: I do believe starting off at the "second" level of abstraction (as opposed to implementing the direct surface area of the service) is not premature as it helps to better understand the problem space, and on the implementation side, as soon as you identify the building blocks, the rest would be really just boilerplate. if you got time, rinse and repeat.
This comes from within Google, which strongly embraces Go -- a language famously impaired when it comes to abstraction capabilities. This opinion has been voiced here before: https://news.ycombinator.com/item?id=8316520
Domain knowledge is just as important when programming as the craft itself, this is why I have to have a great relationship with subject matter expert and develop domain knowledge fast. So abstractions can be a better fit
I wonder why nobody mentioned it - there is one more advanced principle AHA: (Avoid Hasty Abstractions) https://kentcdodds.com/blog/aha-programming#aha- Overusing the DRY principle can make software almost unsupportable.
This is a poorly-selected example, as the real problem here is not the DRY validation, it's that the programmer is abstracting the wrong thing.
Ending up with an awkward class name like `DeadlineSetter` is a dead giveaway that your abstraction boundaries don't make sense - if instead you abstract `Deadline`, and put the invariant check in the constructor thereof, you solve both problems.
It sometimes takes time to discover the best approach to writing code that handles similar inputs. I once worked on an ingestion pipeline that was supposed to clean up data sent to us by a person whose sole job was to sit in front of a Windows PC running three pieces of software:
a) a terminal running a text mode app extracting data from a mainframe
b) Excel
c) Outlook
Their sole job was to copy data from the terminal, paste it into Excel on a daily basis, save it as a CSV file, and send it as an attachment to the address monitored by a script that was responsible to running data processing pipeline. Because of the manual nature of the job and the way data was presented in the terminal there were errors, which were really unpredictable. I was not allowed to talk to the person doing this job or visit them in their office so it took me three months to find out that what was shown on the terminal screen was essentially a 80x25 version of a punchcard and that position of the fields mattered. Sometimes the user would not copy the whole screen, sometimes there would be an extra character added while pasting data into Excel (not always a stray "v"), sometimes a gremlin character would be added between the mainframe and the terminal (there must have been a serial connection somewhere). Forget proper encoding, JSON or XML, it was raw data, really raw. When I started working on the pipeline it would break on every incorrectly formatted record, when I finished the job and left it was only barfing on 5% bad records and those that it could not process would be neatly put aside an emailed to the support people responsible for dealing with this client. There was a lot of repetition initially, but I then discovered patterns (random offsets, encoding errors, extra characters) and that allowed me to build a set of generic classes to quickly implement problem-specific handlers.
People focus too much on DRY and not enough on modularisation.
If your functions do everything, DRYing them can be awkward and ineffective. If you stop injecting your business logic into every line and attempt to create "pure" functions, it's so much easier to sprinkle the occasional `if(input.type2) pow(input, 2)` in the business part.
If your codebase isn't at least 20% "utils", I don't want to touch it. I regularly dip into code I wrote months/years back and reuse it without much thought.
“Anytime you apply a rule too universally, it turns into an anti-pattern“.
Quote from Will Larson found in another HN post (https://review.firstround.com/unexpected-anti-patterns-for-e...) right after checking out this one.
Jimmy Koppel wrote about this 6 years ago [1]. It's one of the first exercises in his software design course [2].
Two identical pieces of code can have different specifications.
# x is an age of a person; the code checks if the person is past the retirement age in the US
def is_of_retirement_age(x):
return x >= 65;
# x is an ASCII character which is already known to be alphanumeric; this code checks if it’s a letter.
def is_ascii_letter(x):
return x >= 65;
[1] https://www.pathsensitive.com/2018/01/the-design-of-software...My rules on code duplication:
- For code in different files you get 3
- For code in the same file you get half a dozen
- For code in the same function you get more like a dozen
- For code in a single repeated block you get as many as you want
It is especially hurtful when people apply DRY immediately on some spaghetti code already mixing abstractions.
Then you find yourself untangling intertwined fatorized code on top of leaky abstractions, losing hours/days and pulling your hair out… (I’m bald already but I’m pretty sure I’m still losing hair in these situations)
This is especially true for a data scientist, where most code is throwaway. If you make it all spectacular, you aren't getting anything done. Data scientists' code should be "eventually good," that is to say it gets refactored as it approaches a production environment. I talk about this in my last book, Agile Data Science 2.0 (Amazon 4.1 stars 7 years after publishing).
https://www.amazon.com/Agile-Data-Science-2-0-Applications/d...
I will say that after 20 years of working as a software engineer, data engineer, data scientist and ML engineer, I can write pretty clean Python all the time but this isn't common.
Sedat wrote about this and other useful stuff.
So refreshing to see this kind of wisdom in a concise blog post!
My take:
In beginners, over-emphasis on DRY is a mistake made because they don't yet understand why DRY is considered a best practice.
In more senior developers, over-emphasis on DRY comes from a few psychological desires... 1) to mitigate the uneasy feeling of not knowing what direction the product will take, and 2) the warm feeling that comes from finding a refactor that makes the code more DRY.
What is overlooked is the cognitive overhead required to un-DRY pieces of code when requirements change. Often the result is a DRY but convoluted series of refactors that obscure the intention of the code and (often) obscure system design intention that would otherwise have been quite clear.
Sadly, many otherwise talented software engineers have the kind of minds that prefer micro-level problem solving and are challenged at big-picture reasoning. There is often actual discomfort when too much big-picture reasoning or synthesis is involved. I view this as more of an emotional than a cognitive limitation, and something that is amplified by the conformist culture found in most large organizations (and which many small ones believe it is best to emulate).
Conformity with best practices is valued above real problem solving. Worse still, there are often elaborate discussions of PRs relating to minutia associated with DRYing up code for which it wasn't necessary in the first place.
Sure, as a system matures there are opportunities to remove cruft and DRY code where it is obviously helpful, but it is silly to waste too much time on it until the true requirements of the system are well understood.
Is the duplication truly redundant or will the functionality need to evolve independently over time?
"Looks the same right now" != "Is the same all the time"
Bad abstraction is worse than no abstraction
I still practice DRY, but I try to not overdo it with unnecessary abstractions. More recently I've been practicing SPOT (Single Point of Truth). I interpret this in two ways. One, every piece of data should have a location that reigns over all others. It's okay to have duplicates of the data, such as caches, but any copies of that data should be treated as ephemeral and possibly inconsistent with the source of truth. Second, there is some overlap with DRY where some logic that answers a question or computes a result should not be duplicates. A specific function or class which computes something important, should probably not be duplicates, but implemented once and reused. A great example is authentication, you most likely should not duplicate code that checks whether a user is authorized to do something. In a sense, the code which computes whether a user is authorized in itself becomes a "source of truth".
There are still good reasons to DRY early on. Actions that need to be performed synchronously, rather than acquiring the same lock in several places, consolidate your code so you're so there's at most a few places you acquire and release that lock. Cache invalidation, having a single class for reading and writing some piece of data makes it much easier to keep the cache consistent.
Anecdotal but I've found it much easier to start DRY and then later duplicate code that diverges a little bit from the original. What is hard is working with a large code base with lots of duplicate code. When you need to change one thing, but there's multiple places you need to make that one change, I've found it very difficult to track down all the places which need to be changed.
When you can't DRY, or when it's simply not practical, at least try to find a way of keeping track of duplicate code. Using enums and global constants can help you by finding all the references to those symbols and locate all the places in your code which need to be updated or refactored.
I tend to follow "1, 2, many"
Duplication in one place, I'm often fine with, because you don't yet know the level of abstraction needed with only two examples.
More than twice however and you should be able to see common patterns across all three implementations and be able to isolate it.
This DRY-sceptical viewpoint is a bit similar to database denormalization.
Sure, in theory you want to store every bit of information only once. But in practice it can make a real difference in smoothing out the access pattern if you don’t follow this normalization religiously.
The same applies to code. If you have to jump through hoops to avoid repeating yourself, it will also make it harder for someone else reading the code to understand what’s going on. A bit of “code denormalization” can help the reader get to the point more quickly.
I know it's supposed to be catchy but "Don't Repeat Yourself" is quite too dogmatic. A little redundancy can absolutely help readablity. Obviously you don't want to repeat complicated code blocks that you have to maintain twice.
What this industry has taught me:
1. DRY initially when your organization is afraid of refactoring because they'll never let you touch it a second time if "it works".
2. DRY later when it's clear the code will not change into genuinely separate branching workflows.
3. DRY it initially when mid-level management uses duplication as a metric to evaluate "good" engineers, especially when your salary is impacted by that perception.
1 and 3 have been a symptom of micro-management from the business side. Only 2 is valid.
When I asked him about this, he said, "I have this philosophy that says if you only have two similar things, it's best to write separate code for each. Once you get to a third, then you can think about refactoring and making some common code."
There is also the angle of when eventually the third usecase comes out, how much willingness/buy-in is going to be there to make changes to the running code for the refactoring?
Quite often than not, nobody wants to take the risk for the sake of introducing DRYness and you end up with three copies.
I don't think the distinction between DRY or don't DRY is interesting at all. Instead what matters is how the abstraction is achieved/performed. Good abstractions stand the test of time, poor ones leak.
The main benefit of abstraction isn't to reduce keystrokes; it's to break a program into comprehensible chunks of operation.
Routines that are conceptually identical should share an abstraction. A concept might benefit from an abstraction even if it is only used once. It is never too early to add more intuitive abstractions to your code.
On the other hand, code that is only coincidentally similar in execution should never be forced to share an abstraction.
Cool tip! I learned that reading other people's code and seeing how much they could get done with far fewer lines of code and much faster than I cause they didn't over-engineer from the start. I got the bad habit of making a big architecture from the beginning because I read many books from academics talking about beauty and elegance, while I admired people in the industry the could get cool things done.
Now I have the experience that most of my projects I only understand after months (or years!) of development, observing the users and testing. Only after I have experience in the project I can actually know what should be the focus of my engineering.
The example is terrible. It's understandable that OP wants to keep it as short as possible. But it is made so simple that it fails to convey the point. You would obviously not want to use the DeadlineSetter class here. It doesn't even ever access its "entity_type" field.
All code bases I have seen in the past 15 years have too little DRY, not too much. Yes, every technique we use has pros and cons and we need to decide in each case whether DRY is worth it. But I worry that people will come away from the article (or even just the headline) with the feeling that "ah, I don't need to DRY". I've been in the situation too many times where somebody copy-pasted and I later had to make it DRY to achieve consistent behavior. Let's err on the side of DRY.
In the example, the right-hand side could either be left as-is. Or it could extract a function:
def set_task_deadline(task_deadline):
_ensure_is_in_future(task_deadline)
def set_payment_deadline(payment_deadline):
_ensure_is_in_future(payment_deadline)
def _ensure_is_in_future(deadline):
if deadline <= datetime.now():
raise ValueError(“Date must be in the future”)
This is much better than the straw man example employed by the author.Just grug it out
also never write tests for code that doesn't exist because you gradually slow down learning to a crawl and you are no longer writing features but tests and mockups that offer nothing to the end user.
My maxim: is "it" intrinsically the same, or coincidentally the same?
Intrinsically the same means a rule, and so there should be 1 source of truth for it. Coincidentally the same means it has the same shape but this just happens to be the case, and they should be left separate to evolve independently.
Ultimately, it boils down to really thinking about the domain.
In my experience DRY and many (any?) other coding principles are only problematic when misused. They're typically misused because the user doesn't understand the motivation or underlying value of the principle in the first place.
I think the example in the article does a bit of that as well. The example sets a deadline on a thing (a task or payment) by validating the deadline against the current time, and then presumably doing something else that isn't shown. The article argues that in the future a task might have different validation requirements than a payment, and they're only coincidentally the same today; so it would be foolish to abstract the deadline setting logic today. BUT, the reality is that the real coincidence is that payments and tasks have the same set of validations, not that the logic to validate a deadline is coincidentally the same. In my opinion "good" code would be fine to have separate set_task_deadline and set_payment_deadline methods, but only one validate_deadline_is_in_future (or whatever) method, alongside other validation methods which can be called as appropriate by each set_x_deadline implementation.
Disclaimer: the code is so short and trivial that it doesn't matter, I think we can all assume that this concept is extrapolated onto a bigger problem.
Reading a lot of this discussion I'm thinking whether DRY itself is the problem or it's more about mixing different (but perhaps comparable) things into one function (be it for the sake of appearing DRY or otherwise).
Rules of thumb are meant to be broken, they aren't laws.
The visualizations in Dan Abramov’s talk “The wet codebase” [1] really burned this concept in for me.
Seeing what a premature, wrong abstraction looks like visually was eye opening.
My rule of thumb is the third time I rewrite some code, I DRY it.
Like any rule this can be taken too far. It happens all the time. People like simple rules. They want everything to be like assembling IKEA furniture: no thought required, just follow the instructions. We all like it because it frees up the mind to think about other things.
There are rules like "don't stick your fingers in the plug socket". But, if you're an electrician, you can stick your fingers in the plug socket because you've isolated that circuit. DRY is similar. As a programmer, you can repeat yourself, but you should be aware that it's thoroughly unwise unless you know you have other protections in place, because you know why such a rule exists.
I suck in the kitchen. If you asked me to make you a sandwich, I would have to go to the cupboard or refrigerator a few times to end up with all the right ingredients. Then I could at least competently assemble the sandwich. My family also loves antipasto salads, which are basically just like a sandwich without bread.
If you asked me to assemble 10 different sandwiches, and 1 antipasto salad, some of which I'm seeing for the firs time, I would attempt to gather all the ingredients, but ultimately end up going back and forth between the cupboard and refrigerator still. I might even think, on one of those trips, hey, I don't need the mayo anymore, so I can put it away, only to have to go back and get it again for a later sandwich. The end result would probably be all the ingredients for every sandwich on the counter at the same time, as I should have done.
I'm pretty smart though. I'm good at Abstraction. So, I assume I'm going to get another order from the family for a sizable amount of sandwiches and some more antipasto salad. I name each sandwich and salad type and then write down a list of ingredients for each sandwich so I can cross-reference it to assemble a master list of all required ingredients per sandwich when the next order comes in. I can then go to the cupboard and refrigerator once.
I then order each sandwich type by their shared ingredients, so that I can apply ingredients only once until I'm done with that ingredient (and then I could put it away, but I'm not a premature optimizer). The only issue is that some ingredients require slicing, like tomatoes, and tomatoes aren't sliced in the same manner for the salad as the sandwiches, and my daughter can't stand when the tomatoes and lettuce touch on her sandwich, and my other daughter wants the cheese and the meat separate. I don't want to overcomplicate the problem, but I don't want to Repeat Myself either, since I know I can grab the tomatoes and slice them all up in the same step, so I need to remember when I assemble my list of ingredients per sandwich and salad that some are exempt from the ordered application of ingredients and must be handled by a single, separate script for assembly.
I run this process a few times, and it works, but I learn that it takes me 35 minutes to do, and that there's now a hard requirement on a frozen item involved with one of the sandwiches that it not be out for more than 10 minutes, so now this ingredient itself must be exempted from the step where I grab all ingredients and my assembly instructions for the one sandwich that involves this ingredient must be very clear that I will still also need to grab that ingredient.
Then I learn/realize:
In-fact, 90% of the time I make a sandwich, or salad, I only make one at a time.
OR
Nobody wants to order sandwiches by name, they just want to give me a list of ingredients in the right order
OR
I am gradually making so many more sandwiches every day that my kitchen counterspace cannot support getting all the ingredients at once
OR
I only make the same sandwiches + one salad every day to the exact same specification
There can be lots of other factors that make a particular refactor more or less desirable. Is the code actually that long or not really, is it already complex or straightforward, was it written well in the first place, etc. Without seeing the particular code, people can jump to any conclusion or justify any bias towards or against any particular refactor including attempts at DRY.
My experience has been that the worst code was also the most poorly tested, if at all. In many cases, you can't really test the code without refactoring it, but you can't refactor it without risking regressions due to lack of tests.
Breaking this cycle requires going back to requirements, whether explicit or by painstakingly inferring every valid use case supported by the intentions of the original code, even if its defects meant it couldn't actually serve those use cases anyway and so cannot even act as a reference implementation for those use cases.
Once you've understood the intended behavior of the old code enough, you now have a test suite to use for any future code. This is usually the hard part [1], and it's going to seem that way because once you have it, finding the simplest code that passes all tests is just programming. Importantly, even if a future maintainer disagrees with you on the best solution, at least they can rewrite it without worrying about regressions against any tested case.
Aside: Performance regressions are more difficult to detect but preparing standard test workloads is a necessary part of that too.
After you ship this rewritten solution you're going to get user issue reports that you broke some edge case nobody has ever explicitly considered before but someone had somehow come to rely on. Now you're only adding a test and logic for that one edge case, you know that no other case was broken by this change, and that this case will never be broken again.
Now you have leverage over the complexity of the project instead of giving it leverage over you. Now you're free to refactor any which way you prefer, and can accurately judge the resulting code entirely on its own merits. You know you're comparing two correct solutions to the same problem, neither one is subtly hiding bugs or missing edge cases. Your code reviewers will also find it much easier to review the code because they can trust that it works and focus the review on its other merits.
[1] You know if your problem domain is an exception better than I do, like if you work on LLVM.
I’ve always found that duplicating and editing over-DRY code is easier than fixing code that’s under-DRY. I strongly prefer working with people that care about DRY code and accidentally go too far than the reverse. Additionally, the worst problems I’ve had in inherited code have been due to duplication and insufficient abstractions leading to logical inconsistency.
This is one of my favorite principles, don't try to make the codebase "too DRY" . I often see it cause stress and complexity in platform or infra teams trying to support large communities of developers.
It's right to be concerned that a bunch of teams might be a wasting time implementing the same thing slightly differently, or that you'll end up fixing the same bug across all these "copies".
Often that kind of duplication is good for the business though, and the platform team doesn't have the insight to see all the divergent requirements on the horizon. Letting the teams innovate separately without having to coordinate changes in all these DRY-ied up systems can be the best way to support them.
Or even more abstractly, zero codebase chaos is not the optimal amount.
But don’t stop telling new developers to be DRY, it’s really just a way to remind them they’re allowed to make functions.
Just step in when they go too far
This is what I found hard about Haskell. It's so tempting to DRY to such a phenomenal degree, but then the slightest requirement change somewhere in the middle breaks some abstraction and bubbles all the way out. It's almost like how a zip encoding of two files that differ by one byte could be completely different.
While the examples may be quite facetious, they often are for demonstrative purposes...
I myself have been burnt by Over-DRYed code - Over-DRYed code tends to lead to God Functions, where 1 change has many unintended consequences. Unintended side effects should be considered a worse scenario over "tedious chore".
In any case, (at least in a API context) a robust test suite focused on end user requirements SHOULD make the mythical "code quality" metric irrelevant.
Just don't prematurely anything and write code that works. If you know how it works you automatically get an intuition what can be made better and where bottlenecks might be. Then you refactor it or just do a plain rewrite.
It's really that simple. (There are always exceptions obviously)
I thought it was Don’t Repeat Yourself more than three times.
Back in my ruby on rails days, we used to have a saying: Don't 'dry out' your code. I still find that to be a good saying to bring up when you're trying to find a good balance between repetitive code vs painfully generic code.
I'm not particularly bothered by having to change several places. The situations where I couldn't solve it with things like grep, find, sed have been very rare, and this kind of solution crops up every now and then.
One of the more common examples is generated code, where I prefer to put that on the side and then copy in the files to the appropriate place in the application and then update things like package declarations in bulk. It makes it harder to overwrite manually added comments or code by mistake.
Tools like ast-grep help too, when more sophisticated search-replace is required.
So I agree, at least in less mature systems DRY is a bad idea that causes a lock-in that will bite you at a later time. Often it's much harder to tear apart a DRY abstraction than creating one.
Yeah, premature DRY is a pet peeve of mine. Especially since the "size" of the code necessary to trigger DRY is totally subjective: some people apply DRY when they see similar blocks of code, others are so averse to repetition they start abstracting out native syntax.
Software development is so varied that blanket statements like this never work.
Never.
I prefer: do the thing when doing so reduces the expected cost of (time-discounted) future outcomes by more than the expected utility of the next best thing you can do now.
The problem with DRY occurs when it contravenes this principal - when deduplication is too expensive and/or unlikely to decrease the cost of future mutations enough to be worth it.
The proposed problem isn't a binary - that you should or shouldn't make the assumption yet - but rather that the assumption has a cost based on what you believe is likely to occur in the future and the value produced by making the assumption now needs to outweigh the cost.
Ah, yes, Google testing blog. From the same company shipping products recommending we eat "at least one small rock per day".
Upper management trying to drive software results by metrics is like trying to win a war with metrics unrelated to battle outcomes.
You must produce X number of tanks, your forces must fire Y bullets, you should minimize the number of retreats.
If you try to manage with no understanding of what's happening on the front lines (and upper management generally can't understand the front lines unless they've worked there recently), you're not going to win the war.
You can never make fast and hard rules about when to repeat yourself and when not to, it probably takes a lifetime to know when to do it correctly. I'm pretty certain programmers are not going to be out of work anytime soon.
Do we have (Python) codebase analyzers that detect subtle duplicates (bane of WET) and overly complex functions called from n places (bane of DRY)?
I have a sticker on my laptop of a yin-yang with DRY and YAGNI instead of the dots.
"premature optimization is the root of all evil"
That's an example for the wrong abstraction, not an example for "no DRY".
Checking if a date is in the future does actually make sense, I would not do it like that (that's more of a `raise_if_not_in_future`), but whatever:
def check_if_in_future(date):
if date <= datetime.now():
raise ValueError(
“Date must be in the future”)
def set_task_deadline(task_deadline):
check_if_in_future(task_deadline)
def set_payment_deadline(payment_deadline):
check_if_in_future(payment_deadline)
Can someone also write an article on how not to write code like is in this article?
`DeadlineSetter` should not be a class, and besides that the implementation makes zero sense. The whole thing should probably just be a single if statement.
The example in the article is too short and incomplete to be meaningful. In a real program there would be something that does actual work for "tasks" and "payments" and deals with errors, providing context for technical decisions instead of forcing the choice of minimum complexity (i.e. two plain functions instead of a class) as the only applicable design guideline.
used once? don't worry, don't think about "what about the possibility it's repeated in the future?"
used twice? okay, maybe I will, maybe I won't
used three+? "don't be lazy ya bum"
I like simple rules, and I don't care if someone wants to turn it into a philosophical debate, I probably won't participate :)
I like a lot the "rule of three" when it comes to have to choose when to DRY.
https://en.wikipedia.org/wiki/Rule_of_three_(computer_progra...
I think this is the best way to think about it:
Ask yourself, if some fact or functionality changes, in how many places would the code have to change?
If it's more than 1, you have a design problem. Of course, the solution does not at all have to be about DRYing.
The question then becomes: when do you break out code or not? Unfortunately (or fortunately if the art and craft or programming fascinates you), the answer is not easy. It seems to have to do with avoiding over-fitting or under-fitting the domain and purpose, to do with getting the best fit in a Bayesian Occam's Razor sense. Minimizing unnecessary code but also doing "Dependency Length Minimization" of the parse tree of your program so that it's maximally understandable and the abstractions increase the potential of your program to correctly interpolate into unknown future use cases. I reflect on some of these points here: https://benoitessiambre.com/abstract.html . It's about entropy minimization, calibration of uncertainty. It's about evolving your code so that it tends toward an optimal causal graph of your domain so that your abstractions can more easily answer "what if" questions correctly. These things are all related.
Like the article ends with, DRY goes hand in hand with YAGNI. The point isn't to build a million abstractions; it's to find the places where you have duplication and de-duplicate it, or where you know there'll be duplication and abstract it, or to simply rearchitect/redesign to avoid complexity and duplication. This applies to code, data models, interfaces, etc.
The duplication is typically bad because it leads to inconsistency which leads to bugs. If your code is highly cohesive and loosely coupled, this is less likely [across independent components].
And on this:
When designing abstractions, do not prematurely couple behaviors
Don't ever couple behaviors, unless it's within the same component. Keep your code highly cohesive and loosely coupled. Once it's complete, wall it off from the other components with a loosely-coupled interface. Even if that means repeating yourself. But don't let anyone make the mistake of thinking they both work the same because they have similar-looking interfaces or behaviors, or you will be stuck again in the morass of low cohesion. This is probably one of the 3 biggest problems in software design.
Libraries are a great help here, but libraries must be both backwards compatible, and not tightly coupled. Lack of backwards compatibility is probably the 4th biggest problem...
"Read" isn't quite the right word for code. "Decode" is better. We have to read to decode, but decoding is far less linear than reading narrative text. Being DRY usually makes decoding easier, not harder, because it makes the logic more cohesive. If I know you only fromajulate blivers in one place I don't have to decode elsewhere.
"Usually" being the keyword and what the article is all about IMHO. I work in a codebase so DRY that it takes digging through dozens of files to figure out what one constant string will be composed as. It would have been simpler to simply write it out, ain't nobody going to figure out OCM_CON_PACK + OCM_WK_MAN means at a glance.
I don't know the codebase, but to my mind that level of abstraction means it's a system-critical string that justifies the work it takes to find.
I mean, sure, I guess API urls could be system-critical. But generally, I prefer to grep a codebase for a url pattern and find the controller immediately. Instead, you have to dig through layers of strings composed of other strings and figure it out. Then at the end, you’re probably wrong.
Sorry, but this doesn't make sense. Why should system critical things be more difficult to understand? Surely you want to reduce room for error, not increase it?
Function calls, the essence of DRY, are only readable if it is well known and well understood what it does.
When code is serial, with comment blocks to point out different sections, it is much easier to read, follow, and debug.
This is also a little bit of a tooling problem
One area I find DRY particularly annoying is when people overly abstract Typescript types. Instead of a plain interface with a few properties, you end up with a bunch of mushed together props like { thing: boolean } & Pick<MyOtherObj, 'bar' | 'baz'} & Omit<BaseObj, 'stuff'> instead of a few duplicated but easily readable interfaces:
interface MyProps { thing: boolean; bar: string; baz: string; stuff: string; }
Am I crazy for almost exclusively just using type and sum types and no generics or interfaces and somehow being able to express everything I need to express?
Kind of wondering what I'm missing now.
Hmm, you can do pretty nice things with generics to make some things impossible (or at least fail on compile), but I agree it’s hardly readable. In some cases you need that though.
I was just mulling this over today. DRY = easier-to-decode is probably true if you're working on groking the system at large. If you just want to peak in at something specific quickly, DRY code can be painful.
I wanted to see what compile flags were used by guix when compiling emacs. `guix edit emacs-next` brings up a file with nested definitions on top of the base package. I had to trust my working memory to unnest the definitions and track which compile flags are being added or removed. https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages...
It'd be more error prone to have each package using redundant base information, but I would have decoded what I was after a lot faster.
Separately, there was a bug in some software aggregating cifti file values into tab separated values. But because any cifti->tsv conversion was generalized, it was too opaque for me to identify and patch myself as a drive-by contributor. https://github.com/PennLINC/xcp_d/issues/1170 to https://github.com/PennLINC/xcp_d/pull/1175/files#diff-76920...
Bazel solves this exact problem (coming from its macrosystem) by allowing you to ask for what I term the "macroexpanded" BUILD definition using `bazel query --output=build //some/pkg/or:target`. When bazel does this, it also comments the file, macro,and line number the expanded content came from for each block.
This gives us reuse without obscuring the real definition.
I automated this in my emacs to be able to "macroexpand" the current buid file in a new buffer. It saves me a lot of time.
Does it? Every time I see DRY'd code, it usually makes the project it's in more difficult to understand. It's harder to understand where values come from, where values are changed, what parts of the codebase affect what. And that's before trying to figure out where to change something in the right place, because it's often unclear what other parts of the code are coupled to it through all the abstractions.
At a high level, at first glance, the code might look good and it "makes sense". But once you want to understand what's happening and why, you're jumping through five different classes, two dozen methods and you still don't know for sure until you run a test request against the API and see what shows up where in the debugger. And you realize your initial glimpse of understanding was just window dressing and actually nothing makes sense unless you understand every level of the abstractions being used.
It's suddenly a puzzle to understand another software developer instead of software engineering.
An IDE can help a lot. Coming from Perl, everything you said was true. I wanted everything in one file as much as possible, and breaking tasks off into functions just meant I had to jump around to try and rebuild the flow in my head. I spent so much time inside the debugger since reading the code would only go so far.
Now I work in C#, we have a lot of classes with a few functions, a lot of helper functions. Doesn't matter since it's so easy to use the tooling to build a mental picture - let alone refactor it in an instant if that variable name feels a bit off, or we think a function is not used (such things were always a risky exercise in Perl).
We refactored one insurance based project to use generic base classes extensively since all insurance shares some attributes and features - this really helped cut down complexity of changes and overall just reduced code on the screen to sift through. I had a lot of fun doing this, I'm a weirdo who almost likes deleting code more than writing it. Once you hit the lowest level it is a little less intuitive due to being generic but at the higher levels we mostly work at, it's simpler, and rolling out a new product we get a lot of stuff for free. They got a long way copy-pasting the product logic (4 or 5 product lines) but at this point it made sense to revisit, and I sneak a bit more in each time I have a change to do.
Well, "read" is still the verb we use most often to describe a human interpreting code. Also, many information-dense books are not intended to be read linearly, yet we still say we're "reading" (or "studying") the book.
Visually parse.
Readability doesn't matter much when you have 10,000+ lines of code. You aren't going to read all that code, and new code introduced by other people continuously isn't something you can keep track of, so even if you understand one tiny bit of code, you won't know about the rest. You need a system of code management (documentation, diagram, IDE, tests, etc), to explain in a human-friendly way what the hell is going on. Small chunks of code will be readable enough, and the code management systems will help you understand how it relates to other code.
I think this is where AI could be helpful in explaining and inspecting large codebases, as an assist to a developer.
Maybe but hallucinations become a real problem here. Even with publicly available API's that are just slightly off the beaten path, I've gotten full-on hallucinations that have derailed me and wasted time.
That's a great point. Everyone lauds the benefits of chatgpt/copilot in generating new code, but I'm starting to learn that the places they might shine is onboarding onto projects and preliminary code reviews. What LlMs excel at is context, and they should excel in activities where context-awareness is key.
10KLoC is a very small app. Ours isn't that big and it's 140KLoC and I have read almost all of it.
To be fair, not all lines of code are equal. A project with a state machine, commands, strategy patterns, etc requires an awful lot of repetitive boilerplate.
A number-crunching app or a data processing pipeline packed with spaghetti business logic is far harder ti read.
And that is why KLoC is a very piss poor metric.
Yeah, good thing we're not a java shop...
As someone who has read 10,000+ lines in order to track down surprising behavior in other people's code, I can say without a doubt that readability still matters at that scale.
Code management systems can sometimes be helpful, but they are no substitute.
Ravioli code is a real problem though. Saying small chunks are readable is not enough. The blast radius of a five byte change can be fifteen code paths and five million requests per hour.
Even if you’re not going to read 10.000+ lines, if the few you read are easy to understand you’re still going to have a much better time maintaining the codebase.
You got it entirely backwards. Readability becomes far more important with the size of your project.
When you get a bug report of a feature request, you need to dive into the code and update the relevant bits. With big projects, odds are you will need to change bits of the code you never knew they existed. The only way that's possible is if the code is clear and it's easy to sift through, understand, and follow.
That system of code management is the code itself. Any IDE supports searching for references, jump to definitions, see inheritance chains, etc. Readable code is code that is easy to navigate and whose changes are obvious.
Readability is almost always (almost only because there are some rare exceptions) the most important thing to me, even for low-level systems software. I always ask myself, “If I don’t touch this code for a year and then come back to it, how long will it take me to understand it again? How long will it take someone who’s never been exposed to this code to understand it?”
Luckily, our compilers and interpreters have gotten so good and advanced that, in 95%+ of cases, we need not make premature “optimizations” (or introduce hierarchies of “design patterns”) that sacrifice readability for speed or code size.
Was reading 1978 Elements of Programming Style a while ago. It's mostly Fortran and PL/I. Some of it is outdated, but a lot applies today as well. See e.g. https://en.wikipedia.org/wiki/The_Elements_of_Programming_St...
They actually have a Fortran example of "optimized" code that's quite difficult to follow, but allegedly faster according to the comments. But they rewrote it to be more readable and ... turns out that's actually faster!
So this already applied even on 197something hardware. Also reminds me about this quote about early development of Unix and C:
"Dennis Ritchie encouraged modularity by telling all and sundry that function calls were really, really cheap in C. Everybody started writing small functions and modularizing. Years later we found out that function calls were still expensive on the PDP-11, and VAX code was often spending 50% of its time in the CALLS instruction. Dennis had lied to us! But it was too late; we were all hooked..."
And Knuth's "premature optimisation is the root of all evil" quote is also decades old by now.
Kind of interesting we've been fighting this battle for over 50 years now :-/
(It should go without saying there are exceptions, and cases where you do need to optimize the shit out of things, after having proven that performance may be an issue. Also at scale "5% faster" can mean "need 5% less servers", which can translate to millions/dollars saved per year – "programmers are more expensive than computers" is another maxim that doesn't always hold true).
The old salty professor who taught numerical physics at my uni insisted that function calls were slow and that it was better to write everything in main. He gave all his examples in Fortran 77. This was in the 2010s...
In fact he is right. The advantage of writing modular code, however, is that we can test the locations where performance is needed and optimize later. With a big main it becomes very hard to do anything complex.
Was he, though? I mean, yeah having to push and pop a call stack does indeed require more work than not having to do that. However, compilers can and do inline and optimize out function calls.
And what's the real performance impact of calling functions a constant number of times outside of the hot path? Is an untestable spaghetti salad of things better than a few hypothetical push and pops?
There's wisdom behind Knuth's remarks on premature optimization.
This is why I liked it when the language I was coding in supported inline expansion: I could keep my code modular but nevertheless avoid the penality of function calls in performance critical functions in the compiled code.
The one gotcha with optimizing for “readability” is that at least to some extent it’s a metric that is in the eye of the beholder. Over the years I’ve seen far too many wars over readability during code review when really people were arguing about what seemed readable *to them*
This is the reason I refuse to use the word "clean" to describe code anymore. It's completely subjective, and far too many times I've seen two people claim that their preferred way of doing things is better because it's "clean", and the other's way is worse because it's "less clean", no further justification added. It's absolutely pointless.
There are a lot of topics in software development where everyone can agree that X is correct. However, *defining* X gets into subjective arguments. And yep, readability and clean code are both in that category.
I am of opinion, code should be written to be readable. Rest of the desirable properties are just side-effects.
Most commonly, code should optimized into being easy to change.
That's almost entirely coincidental with being easy to read. But even easiness to read is a side effect.
I agree with this. Easy to change often means good tests too.
I worked in Perl. Yes it has a reputation for being hard to read, but that was not the problem. Our scripting was pretty basic and easy to read. It's the loose typing, the runtime evals, the lack of strict function parameters, no real IDE, “Only perl can parse Perl” - the fact you can load a module from a network share at runtime, import it, and call a function, based on a certain run flag - and so on. Refactoring was always a mine field and there was a lot I wanted to do in my old job but could not justify it due to the risk.
Fully agree. I think this is something that takes some time/experience to appreciate though. Junior engineers will spend countless hours writing pages of code that align with the “design patterns” or “best practices” of the day when there’s a simpler implementation of the code they’re writing. (I’m not saying this condescendingly—I was once a junior engineer who did that too!)
It’s impossible to know what “good” looks like when you’re new and haven’t seen a few codebases of varying quality and made some terrible mistakes
I think it's fair to say that between behavior and maintainability, one is inflexible and the other hangs from it in tension.
"Side effects" are not the same as "less important traits."
Side effects are usually unrelated or unwanted.
I have a pessimistic view that ultimately the only best practices that matter are the ones your boss or your tech lead likes.
What about when you are the boss or tech lead?
Then the only best practices that matter are the ones that your team believes are correct
The best practices are the ones that allow you to do business and where the maintenance work is relatively not too painful considering the budgeted development time.
Your task is to deliver a good product, not necessarily good code.
the problem is even that in concrete terms can be controversial. everyone wants to minimize maintenance work; not everyone agrees on what kind of code will achieve that.
DRY is IMHO a maintenance thing.
If "I don't want to maintain three copies of this" is your reaction unifying likely makes sense.
But that assumes the maintenance would be similar which is obviously a big assumption.
DRY often gives you the wrong or a leaky abstraction and creates dependencies between sometimes unrelated pieces of code. It’s got tradeoffs rather than being a silver bullet for improving codebases.
Having 0% DRY is probably bad, having 100% DRY is probably unhinged
you are using it wrong
https://news.ycombinator.com/item?id=40525064#40525690
Yes. Especially at the beginning when it's critical to ensure that the logic is correct.
You can then go back and DRY it up while making sure your unit tests (you did write those, right?) still pass.
PS: same applies to "fancy" snippets that save you a few lines; write it the "long way" first and then make it fancy once you're sure it runs the way it's supposed to
not gonna happen once merged
aka "Engineering is about trade-offs"
Agree and would add that software projects also run through different phases in their lifespans with each phase having their own objectives [1].
So while - as you say - best practices can be at odds with each other - dev teams might be following both over time, just prioritizing one in some phase while completely disregarding it during another.
[1] E.g. the UI of the actual product might pivot multiple times at phase 1 because the product has yet to find its niche or core offering. While at a later stage the focus might be on massive scaling, either in numbers of devs or rolling out the product in new jurisdictions. Other phases might be a maintenance one, when an "offshore" team is given ownership or a sundown of an application.
Maintenance is 90% of a project life time. Sometime those "best practices" rigid implemented means the project won't live to see even it's 1st birthday.
Imo the very best approach is a codebase that's small enough that you can just do chunky refactors every so often rather than building in extensibility as a "thing". Not applicable to all problem spaces (I'd hate to do this for UI code), but for a lot of stuff it works really nicely.
For me this often looks like an external DSL/API that stays relatively constant (but improving), with guts that are always changing.
I place copy-pastability somewhere into those priorities too :)
I agree and would add that one of the goals for technical design or architecture work is to choose the architecture that minimizes the friction between best practices. For example if you architecture makes cohesion decrease readability too much then perhaps there is a better architecture. I see this tradeoff pop up from time to time at my work for example when we deal with features that support multiple "flavors" of the same data model, then we have either a bunch of functions for each providing extensibility or a messy root function that provides cohesion. At the end both best practices can be supported by using an interface (or similar construct depending on the language) in which cohesion is provided by logic that only cares about the interface and extensibility is provided by having the right interface (offload details to the specific implementations)