If you’ve worked in software, you’re almost certainly familiar with the term technical debt. There are many ways of classifying technical debt: essential vs accidental, short-term vs. long-term, deliberate vs. inadvertent, whether it is caused by state, control or volume of code. These are all useful constructs.
However, one of key questions around technical debt is knowing when to pay it down. But to pay it down, the most useful breakdown is how does the technical debt manifest itself? Once you can answer that question, it becomes a lot easier to know which debt to “pay down”.
Before we start with that breakdown, let’s choose a definition of technical debt:
Technical debt is any potential technical improvement where the cost of doing the work now would be less than waiting to do it in the future.—Me
This is definition is admittedly somewhat vague, and there are other, better ways of defining technical debt, but for the purposes of this blog post, this definition will suit us well. Importantly, this definition has two attributes:
- It maps quite well to the financial debt metaphor. If you owe a debt, paying it now will cost less (in absolute terms) than paying it in the future, since as long as it is outstanding, it incurs interest.
- It is broad enough to encompass improvements that don’t stem from fixes. In many people’s minds, technical debt is something bad that needs fixing. In reality, we might also want to include improvements that aren’t necessarily fixes to bad things. Because even good things can also be improved.
So let’s break down how technical debt can manifest itself. While all technical debt incurs cost (interest) over time, that cost can manifest in different ways:
- Deferred Cost: Cost that you don’t pay now, but builds up over time, so that the eventual pay-down is quite large.
- Active Cost: Cost that you pay now and on an ongoing basis until you pay down the debt. This can be further split into:
- Interrupts: Some form of ongoing maintenance, whether software is being touched and modified or not.
- Inertia: Something that makes it slow to understand, modify, and improve an existing system.
Let’s dig in to each type, but for the purposes of this post, I’m going to make a few assumptions:
- You are working on a product that is actively being used, but still heavily evolving. For example, a startup with early product-market fit but a product that isn’t mature yet.
- Your user base, product surface area, and team are all growing over time.
Under different assumptions, you might arrive at different conclusions below, but the framework/breakdown should still apply.
There are two types of active cost to technical debt. The first is interrupts. Interrupts are cost that occurs whether you are trying to modify/evolve your code or not. A few examples:
- An operational burden requiring manual intervention, like a server going down and requiring a restart, infrastructure that has to be scaled manually, etc.
- Bugs or product obscurity that requires debugging to explain. For instance, a user sees their billing account in some state that they don’t understand, and someone has to trace down what happened to explain it to that user or whoever is interfacing with that user.
Interrupts are generally the worst kind of technical debt. For one, because they “interrupt” regular workflows, they require context-switching, which has high cost. More importantly, they are really bad for morale. Outside of cultural problems, very few things can burn a team out more quickly than a constant stream of interrupts (and often, cultural problems and a high rate of interrupts tend to be correlated and self-reinforcing). Having a high level of interrupts is usually a sign that a team or company is unhealthy, but it also decreases health over time, so it creates a pretty bad cycle. Interrupts should usually be addressed as quickly as possible compared to other forms of technical debt.
The second type of active cost is technical inertia. Inertia means something is hard to change or move. Many things can cause inertia:
- Slow workflows: Like slow-running tests, difficult deployments or migrations, etc.
- Poor understandability / high complexity: Code that is difficult to understand and/or modify has cost since it slows down anyone who is trying to read or modify it.
- Lack of Consistency: Similar to understandability, but not having consistent patterns and structure to code means the structure and patterns get in the way of understanding or changing the logic which is what really matters. It also slows down code review (”I know the method above does X but please don’t do X because we’re moving towards Y and we don’t want more X in the code base”).
Unlike interrupts, inertia doesn’t cost too much if a piece of software is well-abstracted and doesn’t need to be frequently changed/modified. It’s a messy dresser drawer—it’s a pain if you frequently need to take things in or out, but if you don’t, it’s pretty much out of sight.
You might think because both types of active cost (interrupts and inertia) are highly visible, they would tend to get prioritized for improvements. Counterintuitively, I’ve often found the opposite to be true. Active cost is often ignored because:
- We tend to normalize bad things. If you’re used to your test suite being slow, or a piece of code being difficult to navigate, the amount of annoyance tends to fade over time.
- People who have been on the team longer have either normalized the bad things, found hacky work-arounds, or are just carrying more context around. Newer team members, on the other hand, are easily overwhelmed and slowed down. But newer team members have less authority / confidence to propose/implement fixes.
- A small number of old team members have all the context, and subconsciously either enjoy being the hero that is needed to fix something or get work done or are too burnt out / overloaded to push for improvements.
Deferred cost is a little different than active cost, and probably best illustrated by an example. Let’s say you’re on a certain version of a third-party library and know that you will need to upgrade that version eventually. There’s nothing wrong with the current version. The newer version offers some new features, but you don’t need those new features. So upgrading immediately doesn’t create direct value.
But, the new version is backwards incompatible. If you migrate now, it’ll cost one engineer spending a week. If you migrate in a year, it’ll cost one engineer spending a month because the code base has grown and there is more to migrate and test.
In other words, this piece of debt has no active cost, but it does have a deferred cost. The longer we wait, the longer the eventual cost to pay down, but there’s no active cost. The interest accrues.
Some examples of causes deferred cost:
- Growth in the size of your data (particularly relevant for changes that might require a data migration).
- Growth in the size of your codebase and increases in product complexity.
The argument to fix tech debt with deferred cost is obvious: pay a smaller cost today to avoid a larger cost in the future. However, there are also many reasons to not do the work immediately:
- The opportunity cost of fixing technical debt today might be higher given the size of your team and where your product is. For example, if you’re a small team with three engineers, having an engineer (ie a third of your team) spend one month on a technical project today is a big deal and will severely slow down your product roadmap. On the other hand, if you anticipate having a team of ten or fifteen engineers in a year, having one engineer spend two months has a lower relative cost.
- You might have people with more advanced/specialized skill-sets in the future. For instance, a database migration might be easier to pull off once you’ve hired engineers with deep database experience.
- Migrations might subsume each other. So migrating now might mean you migrate from A to B today, only to migrate from B to C tomorrow. (As a counter-example, this can also cut both ways. While a new migration might mean you do the work twice, sometimes if you defer migrating, the delta becomes so big that it’s nearly impossible to do, e.g. it’s prob easier to do X small migrations to keep up with Python 3.X vs do a major upgrade all at once).
- Related to the above, the longer you wait, the more information you have. For instance, you may have some data model that needs to be migrated, but its ultimate use in the product is unclear, or your understanding of the domain may not be fully developed. The longer you wait, the more information you have, and the more likely you are to redesign it correctly. In other words, you may want to avoid doing work prematurely if you don’t know the “end state”.
There are a few cases where paying down debt with deferred cost is pretty critical.
The first is invisible walls. Invisible walls are problems you run into suddenly. One way to think about this is that when you “borrow” from the technical debt overlords, you’re not borrowing from the friendly community bank down the street with a clear repayment schedule. You’re borrowing from Stewie Griffin, who’s going to show up randomly while you’re getting out of the shower to demand you pay him back.
Invisible walls can take many forms, but here are some examples:
- Scaling inflection points. For instance, you can only scale your database up vertically to a point, after which you might need to completely rearchitect your data storage. Ideally, you’re not trying to rearchitect your storage after you’ve maxed out your other options.
- Forced updates of library dependencies. Going back to our earlier example of an inessential third-party library update, imagine that suddenly the old version of the library requires updates, eg for a vulnerability that’s not easy to patch.
Now granted, you can usually predict these invisible walls if you’re paying enough attention, but it’s not always easy. That said, if you anticipate running into an invisible wall with some piece of technical debt, it’s wise to pay it off before you do so. Here’s a great blog post from the team at Notion about how they sharded their database, and how having a head start helped avoid disaster.
A second failure mode is intractable migrations. The future cost of a migration can balloon super-linearly. If you wait too long to update a third-party library, you might accumulate other third-party libraries that depend on that version, and at some point, your dependencies are too tangled to easily break-down any update into workable chunks.
This can also happen with data. In addition to having dependencies, data can also grow in size to a point that makes migrations really difficult (without large downtime, complicated double-write logic, etc).
In general, good coding practice and hygiene (like keeping things modular and decoupled where possible, etc) can help minimize intractable migrations, but not always. In general, if a piece of technical debt keeps coming up in different contexts, it’s probably best dealt with before it becomes the center of a future intractable migration.
Prioritizing Using This Framework
So when thinking about technical debt, here are the questions to ask:
- What level of interrupts is this piece of debt causing? Anything with high interrupts should be fixed as soon as possible, which will improve morale and free up time/energy to deal with other issues.
- How much inertia is this piece of debt causing? Inertia is less dangerous than interrupts, but the more you can reduce inertia, the faster you’ll be able to both improve the product and pay down other technical debt.
- If the debt is not causing immediate cost, what will the ultimate cost be when the debt needs to be paid down? Is there a chance of running into an invisible wall or creating an intractable migration?
Immediate User Perceived Impact
As a final note, one thing you’ll notice I didn’t talk too much about is immediate user-perceived impact of technical debt (ie bugs, downtime, etc). Obviously, if technical debt is causing immediate problems to your users and your business, that debt is high-priority. It should be easy to reason about the importance of that debt and to get approval from both engineering and product teams, and in fact, the cost of that problem can be more directly compared to other product work