Addressing Tech Debt

The Agile philosophy has left us poorly equipped to deal with this reality.

heads up: this is a pretty old post, it may be outdated.

Tue Aug 15 2017

Let's define the term:

Tech debt is any part of the technology a project uses that causes:

slower development time because of increased complexity of adding features
bad or ineffective product
users, operations, sales, or developers to spend time manually fixing things the product should do
maintenance cost of fixing bugs
maintenance cost of upgrading legacy systems
maintenance cost of not breaking legacy systems when adding new things (land mines)
increased surface area makes the code harder to grok

How we (don't) think about Tech Debt

Modern software development is strongly influenced by Agile. There are many good aspects to Agile, but it's got it share of flaws. That full list is a separate debate :)

This ties back to Tech Debt with Khanban boards. Originally designed for factories, Khanban boards are a visualization of the flow of inventory through the factory. In the software profession, we've largely adopted that methodology under the theory that software can be built in a similar fashion.

The problem is this: factories take in inventory and output goods. Software takes in feature requests and outputs products, tech debt, and bugs. The latter 2 become future inputs. We don't have a factory's uni-directional flow. That makes Khanban like a 2D representation of a 3D reality, yet we still tend to think of a generic "backlog" when prioritizing tasks.

A Fourth Thing

There's a concept I've been struggling to name for a while, so bare with me:

There's actually an implicit category between new features and tech debt that we don't really have a good label for: inputs that are both. The best name I've come up with is "unrealized gains".

	Tech Debt	Not Tech Debt
Feature	"unrealized gain"	new feature
Not a Feature	traditional tech debt	bug

These are things that can provide new or better functionality and pay down tech debt at the same time. Re-writes are generally offered as solutions to these problems. They imply a double opportunity cost to not doing them: the ongoing interest cost of the tech debt and the lost chance to provide better functionality to users.

Existential Tensions

In addition to the tension of features vs. tech debt vs. unrealized gains, there's the meta tension of company/project viability. Meeting a code coverage threshold or architecting the perfect system doesn't matter if the company fails.

Startups (and projects) go through an arc that looks roughly like:

Is this a thing? (What if we put motor on a frame with 4 wheels? Could we get to Kansas in time for Christmas?)
Is it a big enough thing? Can this team do the thing? (It's working! But it sure would be nice to have seats, and maybe windows. Kansas here we come!)
Can this be a big business? (Kansas is pretty cool. But now they all want one of these car things for Christmas.)

At step 1, there's no time to slow down. You've got to figure out as quickly as possible if a motor can even power four wheels together. At the same time, make future-looking decisions: don't build your frame from matchsticks just because they're cheap. Metal is a better choice.

At step 2, you can't afford to slow down. If anything, things are harder now because we're building the car while it's moving. If you picked lightweight aluminum for your frame but realize that makes you too fragile in crashes, it's too late to change now. Keep going with what you've got and plan to improve on the next iteration.

If you make it to step 3, it's time to worry about scale. Re-visit fundamental decisions if you have to, but avoid re-writing. Iterate on small things and plan to release quickly.

When do we work on tech debt?

For a small startup, with just a few engineers, here are some useful questions to ask:

Is it actually a "unrealized gain"? Prefer this over normal tech debt.
If you don't have this, how will the development team be affected in 6 months? The customers? The company? Prefer things have a larger impact over a longer time scale.
Same question but what if you solve the problem?
What can go wrong by doing this task? How bad would that be? Balance great risk with great reward.
How many people are affected by the second order effects?
Can this task be broken into smaller tasks that are individually shippable? Prefer things you can validate early.
If you don't do this, how many person-hours will it cost you each week? Prefer to save person-hours only if it's cheaper.
If your thing is a 10% improvement that's good for companies in stage 3, but at stage 1, we're expecting things more on the order of 1000%. At stage 2 you can slow down to 500%.