One of the things I like to talk about with my teams is the goal of getting a new line of code into production as quickly, as safely, and as cheaply as possible.
The first part of the goal – getting code into production quickly – is something every team strives for.
The second part? In my experience, it tends to be eclipsed by the first part. After all, everyone is under pressure to get the feature they’re working on into people’s hands, and there’s always a backlog of work to do that’s longer than the time available to do it in.
That third part, about the cost? It’s very seldom considered.
But for any organisation, cost is a vital part of software development. More importantly, “cost” is something that needs to be measured at the team, organisation, and company levels: just looking at one isn’t enough. Allow me to explain….
I like to loosely define safety as “confidence that a change doesn’t break anything”. That doesn’t mean that the change is entirely perfect; after all, defects always slip through. It just means that we have a sense of confidence that a change won’t make our systems worse. It’s not a perfect description, but it helps to guide some of my thinking.
This definition also leaves open the question of “what can break?” with a change.
There are the obvious things that spring instantly to mind, such as the feature not working at all, or having unforeseen and undesirable side-effects (using too many resources, accidentally wiping data, or being slower than is useful, for example). There are also less obvious concerns, such as how we interact with other parts of the system, or how changes in APIs may cause code that depends on ours to fail to compile. If you’re using services (micro, macro, I don’t mind. Whatever makes you happy), then the contracts between those services are also places where we can expect – and frequently find – breakages.
The worst possible time to find out about breakages is in production. For applications and services, what “production” means is clear. For libraries and shared utilities, “production” may be the point where someone else takes a dependency on that code (that is, when someone updates the version of the library to the recently released version and tries to recompile) Depending on how frequently dependencies are updated, there may be a lag of months before there’s proper confidence that a change is safe.
For all changes, it’s a wise idea to depend on some level of automated tests. Pull the update in, compile and run any small tests, deploy if necessary to some environment, and then run the medium and large tests. If we’re in a single company with many repositories, it may be possible to identify other repos that depend on the artefacts you’re producing, and to “grind and fix” each of them with the latest change.
So far, we’ve been considering this from a pure engineering perspective, but now we need to don the hat of some kind of manager-crossed-with-an-accountant, and consider costs. How do we make sure our changes are delivered quickly, safely, and at as low a cost as possible?
Back in the Old Days, we used to talk a lot about the cost of change curve, which posits that finding and fixing issues earlier in the development life cycle is cheaper than doing so later on. I think that’s an axiomatic truth, even if the exact details might be something we can quibble over. Put another way, the longer the feedback loop, the more expensive it is to react to the results of that feedback loop; shorter feedback loops are cheaper.
With a compiled language, the earliest point we can get feedback about a change is at compile time. Change an API, and the code won’t even compile. Magical!
The next cheapest way is to run the tests in our repo. Assuming those tests pass, we then need to publish snapshots, and try to coordinate changes between downstream multiple repos (maybe pulling in the snapshot, recompiling, and running all those other tests). Of course, each of those downstream projects need to be updated and tested in a specific order. Your repos all have a graph of dependencies, and we need to follow that graph, so each repo tested needs to publish more snapshots that can be consumed further down the line, and so on, and so on.
The Apache folks tried this with Gump, for Ant, Maven, and other build tools they own. Gump “builds and compiles software against the latest development versions of those projects” It is relatively limited in scope, but it’s already pretty complicated. It’s not a cheap thing to do. Coordinating between the Apache projects is done on a “best efforts” basis, rather than being something that’s mandated, which mirrors what happens in organisations – if you identify something that needs fixing in someone else’s repo, often you have to report it as an issue rather than delving in to fix yourself. I’m sure we’ve all experienced how slow that process can be.
Attempting to detect and follow the graph of dependencies between repos in a company would be challenging, especially if the dependencies are indirect (for example, if a URL for something is hard-coded somewhere, and that’s how the dependency between components is expressed)
The complexity and cost of building and maintaining infrastructure to test and detect this has to be factored into the cost of making the change. You might take a shortcut, and say that you’re only interested in specific downstream consumers of your change, but even then, there’s a cost to be borne, and it’s higher than making a change in a single repo. How come? Because there’s more coordination to manage, and longer feedback loops. As I’ve already mentioned, the inference from the cost of change curve is that the longer feedback loop is more expensive.
In the “farm or grind” blog post, the missing first step is “find out where the changes need to be made”. In the post, Jesse says, “you use a combination of GitHub search, ripgrep and zoekt to find the impacted codebases”, which sounds like something that might work for a majority of cases, but I’m also confident that things would be missed (if, for example, the repos weren’t public, or accessible to the person making the change) Worse, you’ve still got to figure out the graph of dependencies between repos to increase the safety of the change. It ain’t cheap.
So, how do we reduce the cost of building our confidence?
Co-locating code helps an awful lot. Running “ripgrep and zoekt” in a single repo is cheaper than doing so over dozens. Taken to an extreme, this leads you to a monorepo (Yay! Monorepo! Yay!), but there may be perfectly sensible reasons why that’s impractical. In any case, reducing the number of repos reduces the cost of a change. The downside is that the cost of a change becomes more readily visible, and the visibility of future pain is seldom something that excites developers, but from the perspective of the organisation as a whole, the cost of the change has reduced.
A second strategy is to reduce the number of dependencies, and where that’s not possible to have clear and explicit tests in each repo which describe the contract between dependencies. Nat Pryce talks about simplicators, Eric Evans about anti-corruption layers, and Alistair Cockburn introduced the world to hexagonal architectures. All of these help provide that insulation and isolation.
Put another way, the looser the coupling between repositories in an organisation, the cheaper a change in one is likely to be, since it’s less likely to affect the others. Conversely, tight, implicit coupling between repositories is an argument for merging those repos — a change in one is very likely to require a change in another, and inter-repo testing is expensive.
A third strategy is to use a modern build tool which understands the build graph within a single repository, supports caching, and which can identify the subset of targets that need to be built for each change. Right now, I advocate for something like Bazel to support this, but really any tool that properly supports caching and which avoids unnecessary rebuilds that you and your team is happy with is a great choice.
Finally, we need to be conscious that someone needs to pay the cost of each change. As an engineer on a team, the smaller the repository, the cheaper the change appears to me. However, all we’ve done is distribute, delay, and escalate the cost of validating a change because we’ve extended the feedback to production. So, while our cost appears reduced at the team level, the cost to the company is larger.
Worse, I don’t believe that the cost to the team is really as small as we believe. Bug reports coming in from other teams many months after a feature has landed are later on the cost of change curve, and so more expensive to fix. Worse, the context for a change is no longer readily available, so it takes extra engineering effort to properly respond to those bug reports and requests to change when they do eventually arrive.
Does this all mean that a single repository with everything in it is “cheap”? Absolutely not. It’s astonishingly expensive, and the tooling required is cutting edge. However, the alternative is more complex, requires an array of tooling that doesn’t even exist yet, and has lengthed feedback loops. It’s definitely not cheaper in the long run.