The Bottleneck is Not Where You Think

(This post was written by a human. AI did no writing, only brainstorming)

Something unprecedented is happening in software and how we build it.

As a quick example, at Monarch, a few months ago, most of the code we written was written by hand, with maybe 20% written by AI. And the code written by AI was primarily using auto-complete with heavy human involvement.

Today, 70-80% of our code is written by AI, and using more agentic methods (Claude Code, Cursor, etc). This is a huge shift, and we’re still reeling trying to understand what it means.

And we’re, by design, a step behind the bleeding edge. Many other teams are ahead of us, implementing Code Factories or Dark Factories or whatever the term of the month is.

This had me thinking back to my (brief) days managing operations at a factory.

Bottlenecks

I’ve spent most of my career in tech (particularly, consumer software). I’ve started a few companies, did the FAANG thing, and been through acquisitions. But for a brief moment, as a detour, I ran operations at a factory. No need to get into the details of what that factory produced, but suffice to say, the finished product was large (6 meters in length), heavy (multiple tons), involved a mix of concrete and steel, and took over a week to produce.

We experienced a period of rapid growth, which resulted in many operational hiccups as we grew. I quickly learned to think about bottlenecks.

You can think of manufacturing as some sort of Directed Acyclic Graph (DAG). There are multiple steps, and each step requires some inputs (some mix of labor, machinery, and raw material). At any point in time, only one step is, by definition, the system’s bottleneck.

Now, you have to draw your system boundaries pretty widely. For instance, you could have really high production capacity, but if your sales team isn’t selling the product, that’s the bottleneck. Or, you could be low on a certain type of skilled labor, or a certain type of material (a particular type of fuel, bolt, or clip). Or, it could be your ability to assure or control quality (QA/QC).

You can still draw smaller boundaries of subsystems to reason about those particular subsystems, but if you want to influence the overall capacity of the system, your boundary needs to be very wide.

Bottlenecks are actually pretty easy to reason about, and if you know a little about them, it ends up being common sense. Surprisingly, a lot of people don’t reason about them. But if you do, you soon realize:

  • There can only be one system bottleneck at a time (stated again). Absent of changing your entire process, the only way to include the system’s throughput is to increase capacity of that bottleneck.
  • As soon as you “fix” that bottleneck, the bottleneck shifts somewhere else. This is again, follow from the definition of a bottleneck, but it’s often surprising to people how quickly that bottleneck shifts. In other words, doubling the capacity of your bottleneck probably won’t double overall capacity, because you’ll likely hit a new bottleneck. This happens very frequently during growth.
  • If you don’t understand your bottlenecks, bad things can happen. If the bottleneck is upstream (eg, a prior step), we call the downstream process starved. If the bottleneck is downstream (eg, a later step), we call the upstream process blocked. Inventory can build up, quality can degrade, workers can lose morale. It’s bad for business.

So, an ops manager always has (written down somewhere or in their head) some understanding of their process, the capacity at each step, and where the bottleneck is (ideally in measurements like cycle time, throughput, etc). That lets them answer questions like:

  • “We just doubled our ability to produce concrete, but our production didn’t increase.” -> Yes, because that wasn’t our bottleneck. Something else was.
  • “Inventory is building up at Step X” -> yes, because Step X – 1 is overproducing and Step X is the bottleneck.

Back to Software

Our ability to write code has sped up dramatically. Initially, that moved the bottleneck to reviewing, testing, and verifying code. But even those steps are being collapsed by AI. What this has meant is that software development has, overall, sped up pretty dramatically. And it will continue to do so as the coding models, the tools, and the workflows all improve (they cascade in that order by the way: first the models improve, then the tools, then the workflows).

But on many teams, that was never really the bottleneck. People think it is, and for certain teams at certain times, it is, but the real bottlenecks are often:

  • Deciding what should be built (ie deciding what problem to solve).
  • Building it well (ie it delivers customer value).
  • Communicating about it to customers (sales, marketing, distribution, whatever else you want to call it).
  • Reducing/eliminating risk and liability (security, legal, brand/reputational, etc).

Doing these four things often required time, coordination, and lots of communication. Now, these steps are collapsing, too, but within them, new bottlenecks (to the overall system and to each subsystem) emerge.

The challenge is that software development is not like manufacturing. It is a much softer craft. And despite many attempts to quantify and measure it, every seasoned software developer knows that things like Lines of Code or Pull Requests or Commits are vanity metrics, DORA metrics are useful but measure speed not value, and things like Features Shipped don’t measure product quality or code quality.

So reasoning about bottlenecks becomes much more difficult. You can and should try to quantify it, but ultimately, your best bet will be asking the people doing the actual work (especially the ones who have experience) where they think the bottleneck is.

If you do this exercise on your team, my guess is you will find that most people will tell you that the bottleneck is human judgment, human verification, and human coordination. You can build Gas Town and try to take the human out of the loop, but now, you lose the confidence of human judgment, verification, and accountability (accountability is actually a very, very expensive thing to lose, but that’s a topic for another post).

So What’s the Answer

In case it isn’t clear, I think the first step is to try and build a solid understanding of your system, with different boundaries (wide and narrow), and where the bottlenecks actually are. Continuing to speed up the writing of code won’t help if your bottleneck is elsewhere.

You need to take two lenses:

  • For narrow system boundaries, what are the bottlenecks and how can we speed them up? How can we give models the context they need, the ability to verify their own work, and incorporate human judgment only where needed, at each point in the process, on its own?
  • At the same time, for broader system boundaries, explore what end-to-end automation might look like. This will involve changes to workflows. Should we prototype more and first? Probably. Do we still need legal/security review? Absolutely. Can we speed up those reviews with AI? Certainly.

We need to understand that this is all a moving target, as more capable models become available, as we build better tools around them, and as we adapt our workflows to them. And we need to understand that if we attack the wrong bottlenecks, we can make things worse. For example, you can quickly ship a lot of junk code and junk features, which will feel great now, but slow things down in the future.

It’s a crazy time to be a builder of software. An even crazier time to be a builder of systems of humans and machines that build software. Happy building.

Leave a comment